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Preface 


The  Fifth  Distributed  Memory  Computing  Conference  (DMCC5)  was  held  April 
8-l2,  1990,  at  The  Omni  Hotel,  Charleston,  South  Carolina,  and  wm  hosted  by  the 
the  University  of  South  Carolina.  Four  invited  talks  and  99  contributed  talks  were 
presented,  with  12  more  papers  making  up  the  two  mini-symposia.  In  addition, 
approximately  100  posters  were  presented.  This  two-volume  set  includes  papers 
from  all  four  of  these  categories.  Volume  1  covers  applications,  and  Volume  2  deals 
with  all  other  areas,  including  hardware,  software  tools,  performance,  languages, 
and  so  on.  ■' 

DMCC5  continues  the  conference  series,  previously  known  as  the  “Hypercube” 
or  “HCCA”  conference,  that  originated  in  1985  at  the  Oak  Ridge  National  Labo¬ 
ratory  (ORNL).  The  first  two  conferences  were  hosted  by  ORNL  in  Knoxville,  TN, 
and  focused  almost  exclusively  on  the  hypercube  concurrent  computer.  The  scope 
of  the  third  and  fourth  conferences,  respectively  hosted  by  Caltech’s  Jet  Propul¬ 
sion  Laboratory,  and  Sandia  National  Laboratories,  was  broadened  to  include  other 
types  of  distributed  memory  computers.  With  DMCC5  this  trend  has  continued, 
and  as  the  new  name  indicates,  the  conference  series  now  embraces  all  aspects  of 
distributed  memory  computing. 

The  DMCC5  conference  theme  was  “Education”,  which  we  believe  is  essential 
in  encouramng  the  effective  use  of  distributed  memory  computers.  This  theme  was 
promoted  by  half-day  tutorials,  student  conference  awards,  and  a  student  paper 
competition.  A  grant  from  the  National  Science  Foundation  provided  funds  for  26 
student  attendees,  many  of  whom  might  otherwise  have  been  unable  to  participate 
in  the  conference.  The  Student  Paper  Competition  (for  papers  authored  solely  by 
students)  generated  several  good  entries.  IBM  Corporation  generously  sponsored 
3  first  prizes  of  $500  each.  Three  runner-up  prizes  were  sponsored  by  the  College 
of  Science  and  Mathematics  of  the  University  of  South  Carolina,  and  an  additional 
runner-up  prize  was  donated  by  the  Caltech  Concurrent  Computation  Program. 
Prize  winners  are  listed  on  the  following  page.  We  are  grateful  to  the  NSF,  and  the 
sponsors  of  the  Student  Paper  Competition,  for  their  support  of  student  participa¬ 
tion  in  DMCC5. 

A  large  number  of  persons  and  organizations  have  contributed  to  the  success  of 
DMCC5.  We  are  particularly  grateful  to  the  conference  sponsors: 

Air  Force  Office  of  Scientific  Research 
Defense  Advanced  Research  Projects  Agency,  ISTO 
Joint  Tactical  Fusion  Program  Office 
NASA  Ames  Research  Laboratory 
Sandia  National  Laboratories 
Strategic  Defense  Initiative  Organization/OIST 
U.S.  Air  Force,  Electronic  Systems  Division 

We  would  also  like  to  thank  the  members  of  the  Organizing  and  Program  Com¬ 
mittees  for  ensuring  the  smooth  running  of  the  conference.  Also  essential  to  the 
conference  organization  were  those  who  gave  their  time  and  expertise  to  serve  as 
session  chairs,  and  reviewers,  and  participants  in  the  panel  discussion.  Finally,  we 
are  grateful  for  the  support  of  the  DMCC5  host  institution,  the  University  of  South 
Carolina. 

David  W.  Walker 
Quentin  F.  Stout 


student  Paper  Competition  Awards 


First  Prize  (Hardware) 

Philip  R.  Miller  and  Jelio  T.  Yantchev,  Department  of  Electronics  and  Computer 
Science,  University  of  Southampton,  UK,  “Developing  Powerful  Corhmunication 
Mechanisms  for  Distributed  Memory  Computers  from  Simple  and  Efficient  Message 
Routing.” 

First  Prize  (Algorithms  and  Applications) 

Stefan  Vandewalle,  Department  of  Computer  Science,  Katholieke  Universiteit  Leu¬ 
ven,  Belgium,  “Waveform  Relaxation  Methods  for  Solving  Parabolic  Partial  Differ¬ 
ential  Equations.” 

First  Prize  (Operating  Systems  and  Software) 

Anthony  Skjellum  and  Alvin  P.  Leung,  Department  of  Chemical  Engineering,  Cal¬ 
ifornia  Institute  of  Technology,  “Zipcode;  A  Portable  Multicomputer  Communica¬ 
tion  Library  Atop  the  Reactive  Kernel.” 

Runner-Up  Prizes 

Anne  C.  Elster,  School  of  Electrical  Engineering,  Cornell  University,  “Basic  Matrix 
Subprograms  for  Distributed  Memory  Systems.” 

Arjun  Khanna,  Department  of  Electrical  and  Computer  Engineering,  University  of 
Texas  at  Austin,  “On  Managing  Classes  in  a  Distributed  Object-Oriented  Operating 
System.” 

Silvia  M.  Muller,  Department  of  Computer  Science,  University  of  Saarland,  West 
Germany,  “A  Method  to  Parallelize  Tridiagonal  Solvers.” 

Anthony  Skjellum  and  Alvin  P.  Leung,  Department  of  Chemical  Engineering,  Cali¬ 
fornia  Institute  of  Technology,  “LU  Factorization  of  Sparse,  Unsymmetric  Jacobian 
Matrices  on  Multicomputers:  Experience,  Strategies,  Performance.” 
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The  Fifth  Distributed  Memory 
Computing  Conference 


Hypercubes  for  Critical  Space  Flight  Command  Operations 

J.  C.  Horvath,  T.  Tang,  L.  P.  Perry,  R.  C.  Cole 
Mission  Profile  and  Sequencing  Section 
D.  B.  Olster  and  J.  E.  Zipse 

Flight  Command  and  Data  Management  Systems  Section 
Jet  Propulsion  Laboratory,  California  Instiuite  of  Technology 
4800  Oak  Grove  Drive,  Pasadena,  CA  91 109 

Abstract  built  on  the  JPL/Caltech  Mark  III  hypercube[3],  which  is 

Controlling  interplanetary  spacecraft  and  planning  their  a  68020-based  hypercube  topology  distributed  memory 
activities,  as  currently  practiced,  requires  massive  parallel  processor.  The  Maik  III  we  have  been  using  has  4 
amounts  of  computer  time  and  personnel.  To  improve  this  Mb  per  node,  with  no  shared  memory.  Two  68020s  are 

situation,  it  is  desired  to  use  advanced  computing  to  speed  used  on  each  node,  one  of  which  is  dedicated  to  message 

up  and  automate  the  commanding  process.  Several  design  routing,  the  other  of  which  is  used  for  data  processing, 
and  prototype  efforts  have  been  underway  at  JPL  to  Distributed-memory  machines  of  topologies  other  than 
understand  the  2q)propriate  roles  for  concurrent  processtMS  hypercubes  have  not  been  examined  for  this  application  at 

in  future  interplanetary  spacecraft  operations.  Here  we  this  time,  and  will  not  be  discussed  further  in  this  paper, 

report  on  an  effort  to  identify  likely  candidates  for  Other  architectures  are,  however,  being  considered  as  well 

parallelism  among  existing  software  systems  that  both  for  flight  versions, 

generate  commands  to  be  sent  to  the  spacecraft  and 

simulate  what  the  spacecraft  will  do  with  these  commands  Software  systems  have  been  built  to  support  planetary 
when  it  receives  them.  We  also  describe  promising  missions  over  the  years  that  both  generate  commands  to 

results  from  efforts  to  create  parallel  prototypes  of  be  sent  to  deep-space  spacecraft  (e.g.,  the  "SEQGEN" 

representative  portions  of  these  software  systems  on  the  program)  and  simulate  what  the  spacecraft  will  do  with 
JPL/Callech  Mjffk  HI  hypercube.  these  commands  when  it  receives  them.  These  software 

systems  have  grown  up  around  the  architecture  of  a 
I.  Background  Unisys  1100  mainframe  (including  substantial 

implementation  in  assembly),  and  hence  it  will  require 
Controlling  interplanetary  spacecraft  and  planning  their  some  thought  and  planning  to  accurately  port  this  system 

activities,  as  currently  practiced,  requires  massive  to  distributed-memory  parallel  processors  in  a  way  that 

amounts  of  computer  time  and  personnel  [1,2].  As  successfully  exploits  this  architecture.  Along  with  porting 

missions  become  longer  and  more  ambitious,  and  as  the  software  to  a  parallel  machine,  studies  are  being  done 

budgets  become  tighter,  it  is  desirable  to  have  more  to  understand  what  new  capabilities  and  system 

autonomous  spacecraft,  or,  at  least,  more  automated  ways  architectures  are  enabled  by  using  a  more  ciq;>able  machine, 

of  controlling  spacecraft  from  the  ground.  As  spacecraft  particularly  a  parallel  one.  The  current  mainframe 

become  more  autonomous  their  onboard  computers  sequencing  environment  is  a  heavily  batch-oriented  one, 

become  correspondingly  more  complex,  and  simulating  and  a  culture  of  spacecraft  command  review  has  grown  up 

them  on  the  ground  to  predict  and  plan  their  actions  around  the  files  that  go  into  these  existing  programs, 

becomes  more  difficult.  It  is  further  desired  that  these  new  Hence  moving  to  a  parallel  machine  and  building  more 

software  systems  be  "multi-mission"  and  able  to  support  powerful  software  may  enable  some  cultural  changes  (and 

more  than  one  complex  spacecraft.  Hence,  larger  ground  manpower  savings)  in  the  way  command  loads  are  built 

computers  and  more  sophisticated  ground  software  are  for  spacecraft  More  powerful  software  might  also  be 

required  to  support  these  more  advanced  spacecraft.  more  general  and  more  able  to  be  used  for  multiple 

Several  design  and  prototype  efforts  have  been  underway  at  missions,  instead  of  largely  rebuilt  for  each  new  mission. 

JPL  to  understand  the  appropriate  roles  for  concurrent 

processors  in  future  interplanetary  spacecraft  operations.  Figure  lA  shows  the  generic  process  of  commanding 

planetary  spacecraft  commonly  referred  to  as  the  "uplink" 
The  prototypes  that  will  be  described  in  this  paper  were  process.  Figure  IB  shows  the  current  top-level  design  of 
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our  sequencing  and  simulation  system,  which  is 
implemented  as  subcubes  of  a  Maik  III  hypercube.  A  user 
employs  an  editor  of  some  sophistication  and  consults 
with  experts  on  the  various  spacecraft  subsystems  to 
determine  a  plausible  set  of  commands.  A  time-ordered 
file  is  created  with  these  commands  in  it.  These 
commands  are  fed  to  the  expander  program.  When  the 
expander  program  gets  the  file,  it  expands  the  first 
time-slice  of  the  file,  and  then  sends  the  expanded  output 
to  the  checker  program  and  the  simulator  program.  Tbe 
checker  program  checks  the  output  against  high-level 
constraints  ("Don't  turn  the  star  scanner  within  ten  degrees 
of  the  sun.”)  The  simulator  program  predicts  what  the 
spacecraft  will  actually  do.  at  either  the  functional  or  bit 
level,  with  these  commands.  While  the  checker  and 
simulator  are  processing  the  First  time  slice  of  commands, 
the  second  set  is  expanded.  For  future  high-fidelity 
bit-level  simulators,  however,  it  may  be  necessary  to 
expand  all  commands  before  simulating  them,  since 
memory  management  issues  may  come  into  play.  Then  it 
may  make  more  sense  to  multitask  and  Tvst  use  all  the 
processors  of  the  hypercube  to  perform  expansion,  and 
then  use  half  of  them  for  simulation  and  half  for  checking. 


Simultaneous  with  this  system-level  design  (the  First 
incarnation  of  which  was  described  in  [4])  we  are 


prototyping  some  of  the  characteristic  functions  of  these 
codes  in  parallel  to  see  which  portions  benefit  the  most, 
and  which  would  peihaps  be  adequate  in  sequential  mode. 

11.  Parallel  Command  Generation 

SEQGEN  has  several  main  parts.  The  core  routines 
alluded  to  above  are  known  as  "expander"  and  "checker." 
Expander  takes  in  high-level  spacecraft  "{Hofile  activities" 
(PAs)  and  outputs  a  time-sorted  list  of  lower-level 
commands  that  have  been  "expanded"  from  the  input 
PAs.  A  typical  PA  might  be  "take  a  series  of  pictures  of 
Venus."  A  typical  low-level  command  that  might  occur  in 
that  sequence  might  be  "turn  toward  Venus."  In  flight 
SEQGEN.  the  code  that  tells  expander  which  commands 
to  generate  from  a  given  PA  is  written  in  "Sequence 
Component  Development  Language"  (SCDL)  .  SCDL  is 
a  vaguely  FORTRAN-like  "little  language"  which  is  then 
translated  into  PL/1  code.  It  is  this  PL/1  code  which 
actually  runs  on  the  Univac  and  generates  expansions 
from  input  files  of  PAs.  One  could  write  the  code 
describing  how  to  expand  PAs  directly  in  PL/1  (or  C); 
however  since  many  expansion  functions  are  used  over  and 
over,  the  application-  tailored  SCDL  is  more  efficient  for 
developing  actual  flight  expansions.  However,  note  that 
the  translation  of  SCDL  into  the  language  that  actually 
runs  on  the  target  machine  is  only  performed  rarely,  and 
thus  this  translation  step  is  not  redly  important  in  our 
hypercube  timing  and  feasibility  design;  indeed,  it  was 
avoided  for  the  Fu^t  prototype,  as  will  be  described. 

The  parallel  version  of  expander  described  here  avoided  the 
SCDL  step  and  had  hard-wired  C  code  that  parsed  various 
"flight-like"  PAs.  Each  of  the  N  nodes  of  the  hypercube 
expanded  to  completion  1/N  of  the  input  PAs.  The  only 
inefficiencies  were  thus  load  imbalance  among  the  nodes 
of  the  hypercube  (if  some  PAs  were  more  complex  to 
expand  than  others)  and  sorting  inefFiciencies  when  the 
PAs  expanded  on  different  nodes  were  interleaved  into  one 
master  file  on  all  nodes  and  sorted  in  time  order  on  all 
nodes.  It  became  apparent  that  sorting  was  a  bottleneck, 
and  so  we  wrote  a  efficient  parallel  sort  to  decrease  these 
inefficiencies. 

This  program  is  really  a  cross-compiler  for  the  onboard 
computer.  We  have  come  to  the  conclusion  that  front-end 
compilation  (without  optimization)  is  naturally  highly 
parallel,  since  each  command  in  the  input  File  is  checked 
semantically  and  parsed  without  any  interaction  with  other 
commands.  Hence  the  incoming  command  File  can  be 
split  over  hypercube  processors  with  no  interaction 
required  among  the  processors. 
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III.  Integration  of  expander  and  sort 

The  parallel  version  of  expander  with  hard-wired 
expansions  has  been  integrated  with  a  high-speed  sort  [S]. 
This  sort  improved  the  run  time  substantially,  as  shown. 
These  are  all  relatively  small  cases  so  load-balancing 
effects  dominate  after  four  nodes.  Here,  the  efficiency,  E, 
is  dcfmed  as: 

E  =  Time  to  run  with  sequential  sort  (1) 

N  *  Time  to  run  on  N  nodes 

Table  1.  Times  and  efficiencies 
for  expanding  and  sorting 
(times  do  not  include  I/O) 


Nodes  Sequential  sort  Parallel  sort 

time  E  time  E 

_ (5) _ m _ (s)  (%) 


1 

20.94 

(100%) 

23.81 

(88%) 

2 

13.30 

(79%) 

10.60 

(99%) 

4 

10.02 

(52%) 

6.67 

(78%) 

8 

7.58 

(35%) 

4.91 

(53%) 

The  parallel  sort  has  some  overhead  which  slows  down  its 
one-node  version  from  the  non-parallel  sort  one-node 
version.  Hence,  parallel  efficiencies  are  quoted  for  the 
parallel  sort  both  relative  to  the  one-node  non-parallel  sort 
case.  This  overhead  is  such  that  the  code  gives  the 
appearance  of  running  more  than  twice  as  fast  on  two 
nodes.  This  is  a  peculiarity  of  the  sort  implementation. 
From  these  numbers,  it  is  apparent  that  a  large  portion  of 
the  inefficiency  was  indeed  due  to  the  non-parallel  sort. 
Remaining  inefficiencies  are  mostly  of  the  load-balancing 
variety  and  will  improve  when  wc  use  large,  flight-like 
sequences  with  more  complex  expansions  than  our  small 
test  demonstrations,  and  do  not  restrain  ourselves  to  a  test 
case  that  will  fit  on  one  node’s  worth  of  memory  (4  Mb  in 
the  Mark  HI  case.) 

Since  it  is  envisioned  that  eventually  SEQGEN  and  a 
simulator  might  run  on  the  same  cube  at  the  same  lime, 
the  next  step  was  to  modify  expander  to  run  on  just  a 
subcubc  of  the  hypercube,  instead  of  occupying  the  whole 
cube  in  single-user  mode.  This  generalization  was 
completed  late  last  year.  It  is  important  that  we  be  able  to 
run  on  various  parallel  machines  without  massive 
rewrites;  this  generalization  makes  the  code  more  portable. 

As  noted  above,  the  translation  from  SCDL  (in  which 
expansions  are  designed)  to  code  which  can  run  on  the 
host  computer  is  not  done  too  frequently.  However,  if  we 
directly  hard-wire  expansions  we  will  need  to  individually 


code  each  PA  expansion.  Instead,  we  are  building  a 
simplified  SCDL  interpreter  that  produces  C  code  that 
could  run  on  either  the  Mark  III  or  transputer.  This 
SCDL  interpreter  will  initially  interpret  only  the  subset  of 
commands  required  to  interpret  an  expansion  of  the 
Galileo  profile  activity  which  performs  a  spacecraft 
maneuver,  but  as  time  and  resources  is  available  the 
interpreter  could  grow  up  to  full  flight  capability,  and  use 
the  preexisting  PA  definitions  that  were  built  in  SCDL 
for  the  Unisys.  This  will  let  us  know  if  we  are  scalable 
to  full  flight  capability. 

Since  all  expansion  of  given  PAs  is  done  fully 
independently  and  in  parallel,  and  since  all  nodes  have 
knowledge  of  all  the  input  PAs  (although  not  their 
expansions)  it  is  not  necessary  to  add  any  parallel  features 
to  the  expander  C  code  generated  by  the  SCDL  interpreter 
and  hence  this  portion  of  the  code  will  port  nearly 
automatically  to  transputer  or  Mark  III  hypercube.  The 
load  balancing  and  sorting  portions  of  the  code  will  need 
to  be  adapted  slightly,  however,  should  we  port  the  current 
Markin  expander  to  a  transputer,  or  other  machine.  Note 
that  distributed  memory  is  not  a  disadvantage  with  this 
application  and  may  indeed  be  an  advantage  in  this  portion 
of  the  code. 


IV.  Command  Checking  in  Parallel 

A  design  was  developed  for  a  parallel  implementation  of 
SECJGEN  checker,  which  takes  the  low-level  commands 
generated  by  the  SEQGEN  expander  software  described 
above  and  checks  them  against  a  matrix  of  constraints.  A 
design  study  was  performed  to  understand  the  best 
mapping  of  this  problem  to  the  hypercube,  and  both  Time 
Warp  and  Chandy-Misra  implementations  were  considered. 
A  recommendation  was  made  to  use  Chandy-Misra 
algorithms  for  a  future  prototype,  since  this  paradigm 
lends  itself  better  to  the  network-style  interactions  found 
in  Checker. 

Checker  takes  the  list  of  low-level  commands  generated  by 
expander  and  checks  the  commands  against  a  matrix  of 
constraints.  The  constraints  are  also  specified  by  a  flight 
project  in  another  "little  language",  this  time  a  vaguely 
LISP-like  one.  This  parser  changes  constraints  written  in 
this  language  into  code  that  can  be  used  by  the  host 
computer  (PL/1  in  the  case  of  the  Univac  flight  version). 
Hence,  the  translation  into  C  code  can  be  done  once  and 
the  checking  network  will  remain  the  same  for  long 
periods  of  time.  This  will  allow  conventional 
optimization  techniques  to  be  used  to  map  the  constraint 
network  to  the  hypercube  nodes  in  the  most  efficient 
possible  way. 
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IV.A.  Sequential  Discrete  Event  Simulation 
Checker  can  be  thought  of  as  a  discrete  event  simulator; 
it  takes  each  command  to  be  executed  at  a  discrete  point  in 
time  and  models  the  spacecraft  functions  driven  by  that 
command.  All  discrete  event  simulators  operate  on 
sequentially  ordered  event  lists  [6].  In  this  case,  the 
command  sequence  is  the  event  list  and  the  command 
execution  time  is  the  simulation  time.  The  simulator 
takes  the  event  of  lowest  simulation  time  and  removes  it 
from  the  list.  It  performs  all  computations  defined  for 
that  event  and  effects  changes  on  the  system  being 
simulated  (here,  a  spacecraft).  This  event  may  generate 
other  events  for  future  simulation  which  are  placed  on  the 
event  list  in  time  order.  The  simulator  then  moves  to  the 
next  event  and  updates  the  simulation  time. 

Since  the  internal  spacecraft  structure  consists  of 
electronic  circuits,  Checker  functions  to  a  large  extent  as  a 
logic  circuit  analyzer  [7].  A  circuit  node  is  defined  using 
a  constraint  network,  the  set  of  functions  that  simulates 
the  state  of  the  spacecraft  The  command  is  physically  an 
electronic  pulse  and  its  state  is  set  to  true.  This  triggers 
the  constraint  network  functions  that  update  the  circuit's 
state,  and  verify  rules  and  constraints.  This  information 
may  be  used  as  input  for  all  circuit  components  (or  nodes) 
that  the  command  affects.  Then  the  state  of  that  command 
is  set  to  false  and  the  process  repeated  until  all  nodes  have 
been  analyzed.  Some  nodes  may  call  models  which 
simulate  spacecraft  operations.  Models  take  the  form  of 
commands  local  to  SEQGEN  and  are  placed  on  a 
time-ordered  event  list.  These  commands  are  checked  in 
the  same  manner  as  sequence  commands.  When 
necessary,  constraint  network  writes  error  or  status 
messages  to  the  SEQGEN's  output  files. 

Checker's  similarity  to  logic  circuit  simulators  makes  it 
especially  adaptable  to  a  parallel  implementation. 
Constraints  in  separate  sections  of  the  circuit  need  not  be 
checked  sequentially  since  they  do  not  affect  each  other 
immediately  (although  they  may  interconnect  elsewhere). 
Parallel  logic  circuit  simulation  has  already  been 
implemented  successfully  on  several  multiprocessors, 
such  as  the  Intel  iPSC  and  Ametek  Scries  2010,  and 
should  demonstrate  similar  speedups  on  the  hypercube  [8]. 

IV.B.  Parallel  Discrete  Event  Simulation 
Paradigms 

One  way  to  parallelize  SEQGEN  checker  would  be  to 
divide  sequential  procedures  onto  separate  processors. 
However,  finding  the  inherent  parallelism  in  simulation 
code  is  difficult  and  usually  inefficient.  In  this  case,  it  is 
not  at  all  feasible  since  the  JPL  hypercube  does  not 
support  the  PL/1  or  Univac  assembly  code  in  which 
SEQGEN  was  originally  written.  The  Chandy-Misra  and 


Time  Warp  algorithms,  on  the  otha  hand,  take  advantage 
of  the  parallelism  inherent  in  the  system  being  simulated. 
Logic  circuits,  for  instance,  can  be  partitioned  very  easily 
even  to  the  point  where  each  processor  simulates  one 
circuit  node. 

In  parallel  discrete  event  simulation,  the  repeated 
manipulation  of  a  sequential  event  list  would  inhibit 
concunency  [9].  Therefore,  events  are  divided  among  the 
processors.  If  one  event  affects  another  it  sends  a 
timestamped  message  along  the  communication  links. 
The  processor  receiving  this  message  can  accept  it 
immediately  or  place  it  in  an  input  queue.  For  the 
parallel  simulation  to  be  consistent  with  its  sequential 
counterpart,  the  system  must  be  deterministic  which 
means  that  the  current  state  of  the  system  uniquely 
determines  the  next  state.  Otherwise  event  messages  may 
not  cause  the  correct  event  to  be  simulated,  or  the 
simulation  halts  when  a  processor  is  unable  to  make  a 
decision.  The  simulation  proceeds  by  synchronizing 
events.  Time-Warp  and  Chandy-Misra  have  different 
synchronization  schemes  which  allow  speedups  only  when 
their  particular  applications  meets  certain  criteria.  Which 
algorithm  works  better  is  usually  application  dependent. 

IV.B.1.  Time  Warp 

The  Time  Warp  mechanism  is  operational  on  the 
JPL/Caltech  Mark  III  hypercube  and  was  therefore 
seriously  considered  for  implementing  spacecraft 
constraint  checking.  In  Time  Warp,  events  are  distributed 
among  the  processors  and  communicate  with  timestamped 
messages  [10].  Each  processor  stores  messages  in  an 
input  queue  sorted  by  time.  The  message  with  the 
smallest  timestamp  is  simulated  and  the  Local  Virtual 
Time  (LVT  )  is  advanced  to  that  timestamp.  At  any 
point,  one  processor  may  have  an  LVT  ahead  of  the 
others,  but  this  does  not  affect  the  simulation  except  to 
determine  the  timestamp  of  a  message.  A  process  does 
not  concern  itself  with  the  possibility  of  receiving  future 
events;  this  is  considered  an  optimistic  approach.  The 
state  of  each  event  is  saved  in  case  a  processor  should 
receive  an  event  with  a  timestamp  less  than  the  LVT.  If 
it  does,  the  simulation  rolls  back  to  the  event  just 
previous  to  that  timestamp  and  resimulates.  For  example, 
if  a  message  with  the  timestamp  9.5  is  simulated  and  then 
the  message  with  timestamp  8.2  reaches  the  input  queue 
next,  a  rollback  will  occur. 

After  rollback,  messages  that  had  been  previously  sent 
with  timestamps  later  than  the  rollback  timestamp  are 
obsolete  and  must  be  removed.  The  processor  sends  an 
anti-message  which  contains  the  same  information  as  the 
original  message,  but  will  cause  both  to  be  removed  from 
an  input  queue.  If  the  original  message  has  already  been 
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simulated,  or  hasn't  arrived  yet,  the  anti-message  is 
processed.  Upon  discovering  a  duplicate  event  in  the  state 
queue,  an  event  will  trigger  another  rollback  which  then, 
in  turn,  may  cause  more  rollbacks.  How  then,  does  the 
simulation  progress?  Each  processor  also  keeps  track  of 
the  Global  Virtual  Time  (GVT)  which  is  Uie  lowest 
timestamp  of  every  message  or  event  in  the  system.  No 
processor  can  rollback  prior  to  the  GVT,  and  all  event 
states  with  timestamps  less  than  the  GVT  can  be 
removed.  Therefore,  Time  Warp  is  guaranteed  to  progress. 
Time  Warp  terminates  only  when  all  events  have  been 
simulated  and  all  messages  and  anti-messages  processed. 


IV.B.2.  Chandy-Misra 

The  Chandy-Misra  simulation  proceeds  conservatively.  A 
processor  does  not  begin  computation  until  it  is  sure  that 
it  will  not  receive  any  messages  with  a  timestamp  less 
that  its  local  simulation  time  19].  Therefore,  the 
simulation  does  not  require  a  global  time  clock  or  more 
memory  than  necessary  for  the  sequential  simulation. 
Events,  in  this  case  called  logical  processes  (LP),  are 
distributed  among  the  processors.  The  LP 
interdependencies  are  mapped  so  that  a  processor  will  send 
and  accept  messages  from  certain  processors  with  as  many 
input  and  output  queues.  The  LP  waits  until  event 
messages  are  received  from  all  input  links.  In  this  way,  it 
ensures  that  no  future  messages  will  be  received  with 
timestamps  less  than  those  of  the  inputs.  The  event  of 
minimum  timestamp  is  simulated  and  the  local 
simulation  time  updated  to  this  timestamp.  Then  the  LP 
will  wait  to  output  a  message  until  the  receiving  LP 
expects  it.  All  future  messages  will  have  a  greater 
timestamp. 

The  simulation  may  deadlock  if  a  processor  waits  to  send 
or  receive  messages  but  is  blocked.  Deadlock  can  be 
prevented  using  null  messages  which  serve  to  update  the 
simulation  time  and  free  waiting  processes.  However,  if 
the  computation  time  of  an  event  exceeds  the  time  for 
another  process  to  produce  a  null  message,  the  system 
may  become  overloaded  with  null  messages.  Though  this 
cannot  be  avoided  if  there  are  too  many  long 
computations,  some  null  messages  may  be  delayed  and 
bundled  until  they  can  be  send  with  the  next  event 
message.  This  still  prevents  deadlock  and  reduces  the 
number  of  messages.  Deadlock  can  also  be  allowed  to 
occur  and  then  overcome  using  an  algorithm  which 
determines  the  cause  of  deadlock  and  how  to  recover  from 
it.  This  scheme  has  a  computation  time  overhead  which 
may  be  acceptable  depending  on  how  often  the  system 
deadlocks.  The  simulation  terminates  when  it  has 
processed  all  events  and  messages. 


IV.C.  Checker  Implementation 
The  first  and  most  difficult  consideration  of  writing 
Checker  in  parallel  is  determining  how  to  distribute  the 
circuit  nodes  among  the  processors.  Load-balancing 
requires  an  analysis  of  the  circuit  nodes,  their  associated 
constraints  and  the  event  list.  Either  all  of  the  circuit  or 
only  Uie  sections  being  checked  can  be  partitioned  and  sent 
to  the  processors. 

Figure  2  shows  an  example  of  Checker  logic.  The  entire 
circuit  may  be  placed  on  one  processor,  or  the  top  and 
bottom  halves  on  two  separate  processors,  and  node 
"OUT"  on  a  third,  etc.  What  processor  a  circuit  node 
would  be  sent  to  depends  on  how  many  constraints  it  has, 
how  long  it  lakes  to  check  those  constraints,  how  many 
times  it  is  evaluated,  and  what  other  nodes  it  affects  (to 
avoid  interprocessor  communication).  At  first  this 
information  would  be  determined  ahead  of  time,  and 
circuit  nodes  would  be  statically  allocated  to  a  processor. 
(The  current  flight  SEQGEN  checker  has  a  constraint 
network  that  does  not  change  from  run  to  run,  and  hence 
this  mapping  could  be  done  very  optimally,  with  a 
simulated  annealing  or  other  approach.)  Later  it  might  be 
possible  to  implement  dynamic  load-balancing  so  that  an 
idle  processor  could  take  on  another's  work.  However, 
this  might  incur  a  huge  overhead  in  information  transfer. 
Another  large  overhead  would  be  repeated  file 
manipulation.  In  fact,  earlier  versions  of  Time  Warp  did 
not  support  file  handling  because  rollbacks  require 
messages  to  be  buffered  and  written  sequentially  in  global 
time  order.  Checker  writes  each  command  to  SEQGEN's 
ouq>ut  files;  it  may  also  write  status  and  error  messages, 
but  only  if  certain  conditions  are  met.  How  often  a 
processor  writes  to  a  file  greatly  affects  its  load.  Once 
the  synchronization  scheme  has  been  implemented,  we 
will  investigate  various  load-balancing  strategies  to 
optimize  the  simulation. 

Choosing  between  the  Time  Warp  and  Chandy-Misra 
algorithms  proved  to  be  difficult  since  the  constraint 
nctwoik  is  large  and  complex  with  many  different  types  of 
computations.  Knowing  exactly  how  the  system  behaves 
is  essential  in  determining  which  synchronization  scheme 
is  appropriate.  Figure  3  shows  the  structure  of  constraint 
checking  using  Time  Warp.  The  circuit  being  checked  is 
that  of  Figure  2,  where  all  circuit  nodes  are  completely 
distributed  among  the  four  processors.  (Note  the  time 
tags  on  the  commands.)  Individual  commands  to  be 
checked  arc  assigned  to  one  of  the  processors  in  this  case, 
and  instances  of  changes  in  state  of  these  network  nodes 
needs  to  be  transmitted  across  the  hypercube  processors,  as 
shown  with  the  heavy  arrows  in  Figure  3.  Note  that 
although  there  is  an  even  distribution  of  network  nodes 
across  the  hypercube  processors.  Node  3  remains  idle  since 
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no  commands  that  it  handles  were  in  the  input  command 
file.  This  is  an  inherent  risk  with  this  parallelization; 
however  the  other  possibility  (having  the  entire  network 
on  every  node  with  commands  randomly  assigned  to 
processors,  as  is  done  in  expander)  is  unwieldy  to 
implement  owing  to  the  high  interconnection  of  the 
network. 

At  first  it  would  seem  that  saving  all  states  would  take 
up  too  much  memory,  but  that  is  solved  by  removing  all 
states  before  the  GVT.  This  particular  example  shows  the 
bottleneck  that  can  occur  by  simulating  only  a  small 
sequence  of  commands.  However,  the  average  number  of 
commands  in  a  sequence  ranges  from  1000  to  5000  and 
this  bottleneck  would  not  be  as  likely  to  occur  as  long  as 
the  command  assignments  of  command  types  to 
processors  were  reasonably  optimal. 


Although  Time  Warp  could  easily  support  spacecraft 
constraint  checking,  several  factors  make  it  unsuitable. 
Time  Warp  is  a  general  simulation  operating  system  that 
would  have  to  be  adapted  to  the  specific  requirements  and 
intricacies  of  constraint  checking.  It  produces  significant 
speedups  only  when  rollbacks  do  not  present  too  much 


overhead.  This  is  usually  true  for  simulation  of  physical 
systems  where  a  guess  in  behavior  does  not  cause  too 
many  rollbacks.  Physical  systems  involving  large 
computations  allow  most  messages  affecting  an  event  to 
arrive  before  an  incorrect  choice  can  be  made.  However, 
static  logic  circuit  verification  involves  little  computation 
per  node.  It  does  take  time  to  verify  complicated  circuits 
that  may  have  thousands  of  nodes  which  is  what  makes  a 
parallel  simulation  worthwhile.  If  constraint  checking 
added  a  large  amount  of  extra  computation  time,  the 
simulation  would  work  well  with  Time  Warp.  We  tested 
several  different  constraints  on  the  hypercube  and  they 
took  very  little  time  to  process. 
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When  simulating  the  same  system  as  shown  in  Figures  2 
and  3  for  Time  Warp,  if  there  are  only  a  few  commands, 
the  system  deadlocks  very  quickly  if  null  messages  are  not 
sent  or  a  detection  and  recovery  algorithm  used. 
Otherwise  the  logical  process  FF12  has  to  wait  until  it 
receives  both  Cl  at  time  0  and  C2  at  time  3.  The 
Chandy-Misra  method  has  already  been  adapted  for  logic 
circuit  simulation,  though  not  strictly  to  test  circuits,  but 
to  test  the  algorithm  itself.  Therefore,  the  major  drawback 
in  using  the  Chandy-Misra  method  is  that  no 
implementation  exists  on  the  JPL/Caliech  Mark  III 
hypercubc.  However,  in  writing  it  from  scratch, 
eventually  the  simulation  would  be  especially  efficient  for 
spacecraft  constraint  checking.  But  if  the  circuit  is  so 
interconnected  with  multiple  outputs  and  feedback  that  it 
reduces  the  inherent  parallelism  of  the  system,  then 
Chandy-Misra  would  essentially  lock-step  through  the 
simulation  and  Time  Warp  would  be  the  only  alternative. 
Evidence  indicates  that  this  is  not  the  case,  and  a 
simplified  subset  of  Chandy-Misra  will  be  implemented. 

The  prototype  will  simulate  a  small  subset  of  commands 
"  the  same  set  being  used  as  an  example  by  expander, 
allowing  the  integration  of  the  two  prototypes. 
Constraint  network  definitions  will  be  statically 
distributed  to  the  processors  and  each  command  from  the 
sequence  will  be  sent  to  the  processor  containing  its 
definition.  Then  the  simulation  will  proceed  according  to 
the  Chandy-Misra  algorithm.  Though  deadlock  avoidance 
using  null  messages  is  simpler  to  program  than  the 
deadlock  avoidance  and  recovery  algorithm,  we  will 
investigate  which  scheme  is  more  efficient.  Later  more 
constraints  and  commands  will  be  added.  Previous  results 
with  parallel  logic  circuit  simulations  have  shown  that  the 
simulation  will  be  most  efficient  when  a  large  set  of 
commands  is  used  to  reduce  the  number  of  idle  processors 
[8].  The  prototype  implementation  of  Chandy-Misra  will 
serve  to  test  constraint  network  behavior,  since  it  will 
demonstrate  significant  speedups  only  when  a  large 
portion  of  constraint  network  has  been  adapted  for  the 
hypercube. 

How  efficient  the  parallel  implementation  of  Checker 
proves  to  be  is  dependent  on  the  sequence  and  its 
associated  models.  Checker  might  demonstrate  significant 
speedups  for  some  spacecraft  models  such  as  simple 
light-switch  relay  circuits,  but  may  even  slow  down  the 
checking  time  for  a  complicated,  highly-interconnected 
constraint  network  such  as  the  one  modeling  the 
restrictions  on  a  spacecraft  maneuver.  But  since  the 
spacecraft  is  comprised  largely  of  simple  models,  it  would 
be  possible  to  run  Checker  in  parallel  for  most  sequences 
and  sequentially  for  the  few  remaining  sequences  and  still 
work  efficiently. 


One  of  the  major  portions  of  flight  SEQGEN  checker  is  a 
"little  language"  parser,  which  translate  the  LISP-like 
constraint  network  specifications  into  the  C  which  runs 
on  the  hypercube  nodes.  A  subset  of  this  parser  will  be 
needed  for  a  checker  prototype  that  can  perform  the 
constraint  checks  for  ^ofile  Activities  that  are  being 
used  as  tests  for  the  expander  module.  Building  this 
realistic  prototype  will  prove  that  the  checker  can  take  the 
results  of  the  expander  and  check  them  in  parallel. 

V.  Spacecraft  Simulation  in  Parallel 

An  existing  VAX  high-level  simulation  of  some  of  the 
functions  of  the  Galileo  spacecraft  onboard  computer  was 
ported  to  the  hypercube.  This  particular  simulation 
turned  out  to  run  very  quickly  on  one  node,  making 
parallelization  unnecessary.  The  reasons  why  this  was 
true  give  insight  into  design  requirements  for  future 
simulations.  The  Galileo  onboard  computer  is  itself  a 
parallel  computer  (although  of  the  master/slave  variety), 
and  the  VAX  simulation  took  this  parallel  computer  and 
simulated  it  sequentially.  A  large  portion  of  the  code  was 
bookkeeping  to  accomplish  this  simulation  of  a  parallel 
computer.  Therefore,  if  such  a  simulation  is  desired,  it 
should  not  be  ported  from  a  sequential  simulation  of  a 
parallel  system  but  should  be  written  in  parallel  in  the 
first  place.  Some  observations  along  these  lines  will  be 
discussed.  Plans  are  in  place  to  boild  a  simulation  of  the 
full  capabilities  of  the  Galileo  onboard  computers. 

VI.  Flight-qualifying  parallel  code 

An  attempt  was  made  to  understand  what  it  will  take  in 
the  way  of  new  software  testing  methods  and  design  tools 
to  certify  hypercube  code  (and  hypercube  operating 
systems)  for  flight-critical  systems  to  the  same  standards 
that  are  now  applied  to  sequential  codes.  Parallel  codes 
have  several  unique  characteristics  that  make  them 
challenging  to  debug  and  to  test  fully.  In  particular, 
although  any  one  module  may  be  tested  thoroughly, 
exactly  when  and  how  and  with  which  other  program 
running  on  which  other  node  it  will  interact  is  difficult  to 
predict  Quasi-parallel  onboard  computers,  like  the  Galileo 
and  Magellan  onboard  computers,  get  around  these 
difficulties  with  architectures  that  are  both  synchronous, 
are  not  interrupt-driven,  and  arc  conu'ollcd  by  one  master 
processor  (although  which  processor  is  the  master  can 
change  over  tirie.) 

In  parallel  code,  three  major  classes  of  software  failures 
can  be  encountered;  the  hard  failure  (where  the  software 
hangs),  the  soft  failure  (where  the  software  continues  to 
process,  but  gives  the  wrong  numerical  result),  and  the 
algorithmic  failure  (a  "bug"  of  the  same  sort  as  one 


encounters  in  sequential  coding.)  Typically,  one  will 
build  a  sequential  version  of  the  code  which  removes  most 
of  the  latter  bugs,  but  some  new  ones  inevitably  show  up 
as  interaction  phenomena  when  the  software  is  run  in 
parallel  for  the  first  time.  It  has  been  the  first  author's 
experience  that  most  interaction  bugs  show  up  on  two 
processors  and  can  be  removed  there,  and  that  virtually  all 
show  up  on  four  processors.  However,  it  has  occasionally 
been  true  that  certain  bugs  show  up  on  eight  or  sixteen 
processor  or  larger  cubes  only,  and  these  bugs  are 
frequently  subtle  and  Intermittent  in  their  symptoms. 

By  their  nature,  synchronous  codes  running  under  the 
Crystalline  Operating  System  (CrOS)  or  its  commercial 
variations  have  a  priori  predictable  communication 
patterns,  since  interrupt-driven  communications  are  not 
allowed  and  processors  are  required  to  handshake  in  a 
predictable  pattern,  or  deadlock  results.  Synchronous 
programs,  therefore,  are  more  prone  to  the  hard  failure 
during  the  development  phase  while  these  communication 
patterns  are  being  debugged.  Hard  failures  do  have  the 
virtue  that  they  are  usually  easier  to  spot  than  the  soft 
failure  ;  however,  they  are  more  difficult  to  debug  since 
the  cube  will  simply  stop  in  most  cases  with  no 
diagnostic  output.  Sometimes  these  failures  will  not 
show  up  until  perverse  data  is  processed  by  the  code.  The 
general  synchronous  sorting  routine  described  above,  for 
example,  initially  failed  when  presented  with  unbalanced 
partial  lists  to  sort  on  different  nodes;  the  algorithm  had 
assumed  balanced  lists  to  sort  across  the  cube.  This  was 
corrected,  but  at  the  cost  of  more  complexity,  which  in 
turn  exposes  the  code  to  failures  of  the  other  two  types. 

Purely  asynchronous  code  is  probably  the  most  difficult 
for  the  design  of  software  qualification  tests  and  criteria 
for  its  acceptance,  since  the  possible  interactions  of  the 
software  with  itself  and  with  the  commands  it  is  checking 
are  so  numerous.  This  will  particularly  be  a  problem  for 
checking  algorithms,  which  tend  to  naturally  be  somewhat 
asynchronous  and  hard  to  design  even  sequentially  (since 
more  sophisticated  ones  verge  on  expert  systems.) 

We  will  need  to  design  cases  that  test  communication 
paradigms,  and  not  just  output  This  is  analogous  to  the 
process  in  expert  system  debugging  where  both  the 
reasoning  that  produced  a  certain  result  as  well  as  the 
result  itself  both  need  to  be  correct  before  the  system  can 
be  accepted  for  critical  applications  [11].  For  example, 
the  commonly-used  flight  software  testing  technique  of 
regression  testing  would  need  to  be  expanded  to  test 
communications  as  well.  In  a  sequential  code,  testing 
four  sets  of  the  same  command  to  be  expanded  might  not 
be  interesting,  but  might  be  valuable  in  parallel  to  test 
whether  all  processors  take  care  of  the  expansion  the  same 


way. 

VII.  Open  issues  and  conclusions 

Many  open  issues  remain  in  the  use  of  hypercubes  for 
critical  operations.  Issues  of  reliability  and  predictability 
need  to  be  understood  more  thoroughly.  However  we  feel 
that  parallel  computers  in  general  and  hypercubes  have  a 
role  in  spacecraft  commanding  and  other  critical 
applications,  but  many  challenges  remain  to  make  these 
codes  both  powerful  and  reliable.  Although  this  study  is 
geared  to  the  particular  problem  of  generating  commands 
for,  and  predicting  the  actions  of,  a  deep-space  probe,  the 
conclusions  are  applicable  to  efforts  to  generate  and  verify 
code  for  any  critical  computer,  remote  or  not 
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1  Introduction 

This  paper  discusses  a  novel  approach  to  solving 
properly  asynchronous  heterogeneous  problems  on 
a  hypercube  architecture  such  as  the  NCUBE/10 
computer.  The  discussion  is  divided  into  the  clas- 
sification  of  possible  problems  on  the  hypercube,  a 
description  of  blackboards  and  their  utility  in  solv¬ 
ing  properly  asynchronous  heterogeneous  prob¬ 
lems  on  the  NCUBE,  the  application  of  these  tech¬ 
niques  to  a  specific  example,  structure  elucidation 
of  organic  compounds  from  spectroscopic  data. 

The  NCUBE/10  [13,  15]  is  a  MIMD  computer 
consisting  of  1024  32-bit  processors  and  a  coarse 
grained  homogeneous  distributed  memory.  The 
types  of  problems  can  be  divided  and  described 
in  terms  of  how  the  nodes  communicate  and  how 
the  problem  is  subdivided  between  the  individual 
nodes  [6].  Communication  can  be  synchronous, 
loosely  synchronous  or  asynchronous.  In  syn¬ 
chronous  communication,  all  processors  are  doing 
the  same  thing  at  the  same  time  and  as  a  result,  all 
communication  is  automatically  synchronized.  In 
properly  loosely  synchronous  problems,  each  pro¬ 
cessor  is  doing  its  own  thing  but  must  synchronize 
with  all  other  processors  whenever  doing  interpro¬ 
cessor  communication.  In  properly  asynchronous 
problems,  each  processor  acts  independently,  com¬ 
municating  whenever  necessary  without  regard  for 
synchronizing  with  the  other  processors.  In  a  se¬ 
quential  asynchronous  problem,  there  is  a  lock- 
step  order  which  allows  no  parallelization  of  the 
problem  whereas  concurrent  asynchronous  prob¬ 
lems  can  be  processed  in  parallel.  Amdahl’s  Law 


states  that  the  maximum  speedup  for  any  problem 
is  always  bounded  by  1/s  where  s  is  the  time  for 
the  serial  work  fraction  (that  portion  which  cannot 
be  parallelized).  Asynchronous  problems  are  the 
most  difficult  problems  to  execute  efficiently  in  an 
MIMD  environment  and  therefore  offer  the  great¬ 
est  challenge  to  the  programmer.  According  to 
Fox,  out  of  84  programs  reviewed  only  8  properly 
asynchronous  problems  were  identified.  Properly 
asynchronous  problems  contain  the  greatest  un¬ 
certainty  for  speedup  on  a  machine  like  the  Ncube. 
No  matter  what  type  of  problem,  it  is  essential  to 
minimize  communication  and  when  it  is  necessary, 
to  restrict  it  to  adjacent  processing  nodes  with  rel¬ 
atively  large  program  segments  working  largely  in¬ 
dependently  if  at  all  possible. 

Synchronous  communication  implies  that  all 
processors  are  executing  identical  code  on  different 
segments  of  the  data.  Asynchronous  communica¬ 
tion  may  also  involve  identical  code  but  because 
of  differences  in  the  complexity  of  the  data,  the 
processors  may  not  complete  their  assignments  at 
the  same  time.  On  the  other  hand,  it  is  feasible 
that  different  processors  may  actually  be  execut¬ 
ing  different  code  simultaneously.  Fox  has  “ruled 
out  course  grained  functional  decomposition,  e.g., 
different  sub-routines  of  a  given  application  run¬ 
ning  on  different  nodes,  because  this  is  only  capa¬ 
ble  of  obtaining  modest  speedup  as  essentially  all 
real  applications  only  have  a  few  distinct  actions 
to  be  performed  currently.”  We  are  interested  in 
just  how  much  speedup  is  possible  in  a  problem 
which  involves  asynchronous  communication  be¬ 
tween  different  segments  of  code  executing  simul- 
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taneously. 

2  Blackboard  Systems 

Our  approach  for  solving  heterogeneous  problems 
with  properly  asynchronous  communication  on  a 
parallel  computer  is  to  use  a  blackboard  model 
for  control  of  the  different  code  executing  simul¬ 
taneously  and  for  communication  of  the  partiaJ 
solutions  between  the  different  code  segments.  A 
blackboard  [22]  is  a  problem-solving  model  that 
allows  opportunistic  reasoning.  The  blackboard 
model  consists  of  three  major  components;  con¬ 
trol,  the  blackboard  data  structure  and  separate 
knowledge  sources. 

Control  dynamically  selects  which  knowledge 
sources  to  execute  and  which  data  to  manipu¬ 
late  at  any  one  time,  thereby  coordinating  the 
manipulation  of  the  blackboard  and  determining 
the  focus  of  attention.  This  is  especially  useful 
for  complex  ill-structured  problems  with  poorly 
defined  goals  and  an  absence  of  a  predetermined 
decision  path  which  is  a  good  description  of  how 
a  heterogeneous  asynchronous  problem  must  be 
solved  in  a  parallel  environment.  Because  control 
is  dynamic,  it  can  utilize  opportunistic  reasoning 
techniques  and  avoid  lock-step  control  sequences 
which  would  make  execution  inefficient.  Alternate 
solutions  can  develop  simultaneously  and  heuris¬ 
tic  methods  can  be  used  to  short-circuit  compu¬ 
tation.  Forward  and  backward  reasoning  can  be 
used  simultaneously.  Control  maintains  criteria 
for  determining  when  to  terminate  execution  and 
is  capable  of  handling  errors  gracefully. 

The  blackboard  data  structure  is  a  global 
database.  The  individual  knowledge  sources  com¬ 
municate  and  interact  via  the  blackboard.  The 
solution  space  includes  all  possible  partial  and  full 
solutions.  It  can  be  organized  into  one  or  more 
application-dependent  hierarchies.  Each  level  of 
the  hierarchy  can  contain  its  own  unique  vocab¬ 
ulary.  Ultimately  these  vocabularies  will  coalesce 
as  the  solutions  develop  and  progress  up  the  hier¬ 
archy.  Reasoning  which  supports  the  partial  solu¬ 
tions,  can  come  from  below  and/or  above  in  the 
hierarchy.  The  blackboard  handles  all  message¬ 
passing  constraints  and  allows  communication  be¬ 


tween  disparate  sources  of  information  regardless 
of  the  different  vocabularies  involved.  Ultimately 
this  should  minimize  communication  between  spe¬ 
cialists  so  that  the  individual  processors  can  con¬ 
centrate  on  computation. 

In  order  to  limit  interaction  between  knowledge 
sources,  the  problem  is  decomposed  into  loosely 
coupled  subproblems.  The  individual  knowledge 
sources  can  be  implemented  as  rules,  objects  or 
procedures.  Each  knowledge  source  knows  what  it 
is  capable  of  contributing  to  the  solution.  Details 
of  the  task  dictate  the  type  of  knowledge  represen¬ 
tation  and  the  reasoning  methods  employed.  The 
interaction  is  organized  hierarchically,  with  inte¬ 
gration  of  diverse  concepts  and  vocabulary.  Each 
knowledge  source  can  be  modified  without  affect¬ 
ing  the  other  sources,  making  it  relatively  easy  to 
update  the  knowledge  base.  Since  each  knowledge 
source  works  relatively  independently  no  one  piece 
of  data  becomes  a  barrier  to  the  solution  but  addi¬ 
tional  information  will  improve  performance.  Any 
uncertainty  is  handled  with  credibility  weights. 
Conflicting  data  can  either  be  eliminated  if  the 
difference  in  certainty  is  large  or  both  partial  so¬ 
lutions  can  be  saved  independently  on  the  black¬ 
board  for  further  processing. 

Blackboards  are  especially  effective  at  handling 
problems  with  large  solution  space,  dependent  on 
noisy  and  unreliable  data.  A  variety  of  input  data 
can  be  handled.  Multiple  reasoning  methods  can 
be  used  simultaneously  and  solutions  can  develop 
in  stages.  The  blackboard  is  a  potentially  effective 
method  for  finding  and  expressing  parallelism  in  a 
heterogeneous  asynchronous  problem.  The  knowl¬ 
edge  sources  can  be  divided  to  provide  optimal 
grain  size  of  data  and  knowledge.  The  knowledge 
sources  know  what  they  can  do  and  can  be  respon¬ 
sible  for  local  control  while  Control  is  responsible 
for  global  control. 

3  The  Structure  Analysis  Prob¬ 
lem 

We  have  chosen  a  problem  which  should  be  an  ef¬ 
fective  test  of  using  a  blackboard  to  maximize  con¬ 
currency  for  a  heterogeneous  asynchronous  prob¬ 
lem.  Organic  chemists  are  always  concerned  with 
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determining  the  chemical  structure  of  unknown  or¬ 
ganic  compounds.  In  the  past  this  was  done  by 
using  a  battery  of  chemical  tests  but  these  tests 
are  messy,  time  consuming  and  destroy  the  sample 
in  the  process.  Modern  organic  structure  elucida^ 
tion  depends  heavily  on  absorption  spectroscopy. 
In  absorption  spectroscopy,  a  specific  frequency  of 
light  is  passed  through  the  organic  compound  to 
determine  whether  the  light  is  absorbed  or  not. 
Measuring  the  absorption  gives  a  variety  of  Infor¬ 
mation  about  the  compound.  The  advantages  of 
these  techniques  is  that  they  are  quick,  relatively 
easy  and  do  not  destroy  the  sample.  The  disad¬ 
vantage  is  that  the  resulting  data  must  be  ana^ 
lyzed  by  an  “expert”  We  intend  to  develop  an 
expert  system  which  can  run  effectively  on  the 
hypercube.  Some  of  the  characteristics  of  this 
problem  are  as  follows.  There  are  several  differ¬ 
ent  types  of  spectroscopy  which  look  very  differ¬ 
ent,  and  give  very  different  information.  In  other 
words,  they  involve  very  different  expertise  and 
very  different  “vocabularies”  which  must  some¬ 
how  be  integrated  to  generate  the  structure.  Each 
spectrum  may  include  many  different  absorptions. 
Much  of  the  information  in  the  spectra  is  uncer¬ 
tain  and  ambiguous.  Therefore  a  human  expert 
will  process  the  most  important  and  most  reliable 
information  in  each  spectrum  first.  If  a  structure 
can  be  determined,  no  further  processing  is  neces¬ 
sary.  On  the  other  hand,  if  the  solution  is  incom¬ 
plete,  further  processing  of  the  data  can  be  done. 
Some  of  the  ambiguity  in  the  data  is  alleviated  by 
the  fact  that  data  from  different  spectra  can  re¬ 
inforce  each  other.  If  data  from  different  spectra 
conflict,  the  data  with  the  greater  reliability  are 
used  first.  If  this  fails,  the  data  can  be  reevalu¬ 
ated.  A  chemist  may  not  have  all  the  data  that 
would  be  useful.  Each  facility  will  have  different 
equipment  with  limits  on  their  capabilities.  There¬ 
fore  a  human  expert  must  be  flexible  in  what  data 
and  what  order  is  used  to  solve  the  problem. 

To  handle  these  problems,  the  computer  expert 
must  be  modular  so  that  individual  experts  can 
work  independently  on  the  different  spectra.  This 
also  makes  it  easier  to  update  and  add  new  experts 
without  affecting  the  performance  of  existing  ex¬ 
perts.  Computer  programs  are  often  designed  in 
a  very  sequential  manner  where  each  set  of  data 


is  exhausted  before  moving  on  to  other  sets.  As 
already  indicated,  this  is  counterproductive  with 
spectra.  The  blackboard  approach  will  allow  a 
much  more  flexible  manipulation  of  the  spectra, 
allowing  for  opportunistic  reasoning  similar  to  a 
human  expert’s  approach.  Since  the  blackboard 
allows  several  different  solution  paths  to  develop 
simultaneously,  no  one  piece  of  spectral  data  will 
inhibit  the  progress  of  the  program. 

4  The  Knowledge  Sources 

The  following  is  a  description  of  some  of  the  ex¬ 
perts  involved  in  the  elucidation  of  chemical  struc¬ 
ture.  The  knowledge  sources  can  be  divided  into 
three  groups  of  experts,  the  structure  generation 
routines,  spectral  experts,  and  chemical  experts. 

Structures  are  generated  in  stages  [8].  The 
different  stages  can  be  executing  simultaneously 
since  more  than  one  structure  is  often  possible. 
These  routines  are  modeled  after  the  design  of 
CHEMICS.  The  following  are  a  list  of  primary 
components  which  are  basic  components  for  con¬ 
structing  organic  molecules:  CH3,  CH2,  CH,  C, 
CO,  OH,  O,  NH2,  NH,  N,  SH,  S,  F,  Cl,  Br, 
and  I.  Secondary  components  are  combinations 
of  primary  components  useful  in  constructing  or¬ 
ganic  molecules  and  that  can  be  related  to  spectral 
data.  There  are  86  secondary  components.  Ter¬ 
tiary  components  consist  of  a  secondary  compo¬ 
nent  combined  with  an  “afferent  nature”  which  is 
simply  a  restriction  on  what  the  secondary  com¬ 
ponent  can  bond  to.  There  are  630  such  com¬ 
binations.  Initially  maximum  and  minimum  val¬ 
ues  for  primary  and  secondary  components  are  de¬ 
termined  from  the  molecular  formula.  All  possi¬ 
ble  sets  of  primary  components  are  determined. 
Based  on  these,  the  possible  sets  of  secondary 
components  are  generated.  The  secondary  compo¬ 
nents  are  reviewed  for  consistency  with  the  chem¬ 
ical  formula  and  spectral  data.  Finally  tertiary 
component  sets  are  generated.  The  sets  of  tertiary 
components  are  used  to  generate  complete  struc¬ 
tures  including  the  bonding  of  the  components. 
There  are  a  variety  of  steps  done  including  gen¬ 
eration  of  linkage,  tests  for  absolute  linkage  and 
absolute  nonlinkage,  checking  of  separated  struc- 


13 


lures  and  checking  of  generated  structures. 

Common  spectroscopy  techniques  include  in¬ 
frared,  ultraviolet,  proton  NMR,  carbon-13  NMR, 
and  mass  spectroscopy.  The  characteristics  of  an 
infrared  spectrum  include  frequency  in  reciprocal 
centimeters,  the  intensity  (strong,  medium,  weak) 
and  shape  (broad)  of  the  peaks.  The  infrared  is 
particularly  important  in  determining  the  pres¬ 
ence  or  absence  of  specific  functional  groups  such 
as  carbonyl  (C=0),  hydroxyl  (OH),  amino  (NH), 
nitrile  (CN),  and  carbon-carbon  double  and  triple 
bonds. 

Mass  spectroscopy  gives  entirely  different  re¬ 
sults.  It  is  a  bar  graph  where  the  one  coor¬ 
dinate  corresponds  to  mass  to  charge  ratio  and 
the  second  relates  to  the  abundance  of  the  ion. 
Mass  spectroscopy  also  contains  a  large  number  of 
peaks  where  many  of  the  peaks  are  often  ignored 
in  determining  chemical  structure.  The  mass  of 
the  molecular  ion  is  the  molecular  weight  of  the 
molecule  which  makes  it  particularly  important. 

Ultraviolet  spectroscopy  is  much  simpler  but 
also  much  less  useful  than  either  infrared  or  mass 
spectroscopy.  An  absorption  in  the  ultraviolet  in¬ 
dicates  conjugation  (alternating  double  and  single 
bonds).  Usually  there  are  only  a  few  absorptions 
or  possibly  none.  Analysis  depends  on  the  fre¬ 
quency  and  the  intensity  of  the  peak. 

Proton  NMR  (nuclear  magnetic  re.sonance) 
spectroscopy  can  be  analyzed  based  on  the  chem¬ 
ical  shift  in  parts  per  million  or  ppm  (chemi¬ 
cal  environment  of  the  different  hydrogens  in  the 
molecule),  integration  (ratioof  different  hydrogens 
in  the  molecule),  and  splitting  pattern  (number  of 
adjacent  hydrogens).  It  contains  large  amount  of 
useful  information  and  all  absorptions  are  signifi¬ 
cant  in  structure  determination. 

Carbon- 13  NMR  is  a  source  of  information 
about  the  carbons  in  the  molecule  just  as  pro¬ 
ton  NMR  is  a  source  of  information  about  the 
hydrogens  in  the  molecule.  Chemical  shift  and 
splitting  are  available  but  integration  is  not.  The 
range  of  chemical  shift  is  over  200  ppm  (unlike 
proton  NMR  where  overlap  often  occurs).  As  a 
result  there  is  very  little  chance  for  overlap  be¬ 
tween  chemically  different  carbons. 

Other  spectroscopy  tcchniqties  are  continually 
l)eing  developed  and  refined.  Our  system  will  al¬ 


low  easy  modification  and  addition  to  the  experts 
without  disrupting  the  system.  Each  spectroscopy 
expert  must  handle  very  different  data  as  indi¬ 
cated  above  but  the  analysis  of  each  results  in 
molecular  fragments.  Therefore  the  blackboard 
will  allow  the  individual  experts  to  use  their  own 
unique  vocabulary  and  ultimately  convert  it  into  a 
universal  vocabulary  of  molecular  fragments  which 
are  then  used  to  direct  and  restrict  the  struc¬ 
ture  generation  routines.  As  the  experts  gener¬ 
ate  components,  these  can  be  used  to  prune  pri¬ 
mary,  secondary  and  tertiary  components  thereby 
preventing  combinatorial  explosion.  If  fragments 
from  different  experts  conflict,  the  fragments  with 
the  highest  certainty  factor  will  be  added  to  the 
active  fragment  list  and  the  other  fragments  will 
be  retained  on  an  inactive  list.  The  inactive  list 
will  be  used  only  if  the  active  list  fails  in  generat¬ 
ing  a  structure(s)  which  adequately  explains  the 
data.  Control  will  use  the  spectroscopy  experts 
not  only  to  generate  possible  fragments  but  also 
to  test  generated  structures  to  sec  that  they  are 
consistent  with  the  spectral  data.  Therefore  the 
spectroscopy  expert  will  be  used  both  at  the  front 
end  and  at  the  back  end  and  will  be  an  integral 
part  in  determining  when  to  terminate  execution. 
Like  any  good  expert,  the  .system  will  also  be  able 
to  determine  when  there  is  insufficient  information 
to  identify  one  striiclure  as  the  correct  one. 

An  example  of  the  solution  of  a  spectroscopy 
problem  is  as  follows;  (This  problem  is  included  to 
indicate  the  variety  of  information  which  must  be 
integrated  in  a  structure  elucidation  problem.  The 
actual  details  would  probably  be  comprehensible 
and  interesting  only  to  another  chemist.)  The 
molecular  formula  for  the  unknown  is  C7//12O4. 
The  primary  structure  generator  determined  that 
there  were  93  possible  combinations  or  sets  of  pri¬ 
mary  structures.  Using  these  as  the  starting  point, 
497  sets  of  secondary  structures  were  produced 
as  possible  candidates  using  the  secondary  struc¬ 
ture  generator.  All  possible  combinations  of  these 
would  then  be  tested  in  determining  the  tertiary 
structures  and  the  complete  structural  formulas 
that  were  possible.  Alternatively  the  spectroscopy 
experts  could  be  used  to  prune  the  primary  sets, 
drastically  reducing  the  number  of  possibilities. 

The  following  information  was  obtained  from 
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spectra.  The  Infrared  spectrum  for  this  un¬ 
known  contained  at  least  15  distinct  peaks  (or  ab¬ 
sorptions)  but  two  were  particularly  important, 
a  broad  absorption  at  3000  cm-1  and  a  strong 
peak  at  1725  cm-1.  This  was  indicative  of  the 
-OH  and  C=0  of  a  carboxylic  acid  group.  The 
mass  spectrum  contained  approximately  30  peaks 
with  the  parent  or  molecular  ion  being  at  160 
m/e.  This  could  be  used  to  determine  the  molec¬ 
ular  formula  which  was  C-rHi20i.  The  ultravio¬ 
let  spectrum  contained  no  significant  absorption 
which  indicated  that  the  molecule  lacked  conjuga¬ 
tion  (alternating  carbon-carbon  double  and  single 
bonds).  The  carbon-13  NMR  contained  7  absorp¬ 
tions.  The  first  two  came  at  about  180  ppm  offset 
from  TMS  which  suggests  two  different  carbonyl 
carbons  (C=0).  The  next  peak  at  70  ppm  was  due 
to  the  solvent  used  (dioxan).  The  last  four  peaks 
were  in  order  a  singlet  (C),  triplet  {CH2),  triplet 
{CH2)  and  quartet  (CH3)  in  the  off-resonance  de¬ 
coupled  spectrum.  Finally  the  proton  NMR  con¬ 
tained  a  peak  for  HOD  (indicating  an  exchange¬ 
able  proton  in  the  molecule),  a  singlet  for  the  sol¬ 
vent  (dioxan),  two  multiplets  integrating  for  two 
protons  each  (CH2-CH2)  and  a  singlet  integrating 
for  six  protons  (two  methyl  groups) . 

Using  all  this  information  to  prune  the  sets  of 
primary  structures  leaves  only  one  possibility,  the 
set  containing  the  following  primary  components: 
1  methyl  {CH3),  2  carbonyl  groups,  2  hydroxyl 
groups,  (combining  these  give  you  two  carboxylic 
acid  groups,  which  are  examples  of  secondary 
structures),  1  carbon  (C),  and  2  methylenes 
{CH2)-  When  combined,  the  only  feasible  struc¬ 
ture  is  (CH3)2C(C02H)CH2CH2C02H.  This 
dramatically  reduces  the  number  of  possibilities 
that  must  be  explored. 

In  some  problems,  the  available  information  will 
be  less  restrictive.  As  a  result,  the  structure  gen¬ 
erator  will  have  much  more  work  to  do.  In  these 
cases,  it  is  feasible  for  Control  to  divide  the  work  of 
the  structure  generator  between  several  processing 
nodes.  The  following  example  demonstrates  this: 

The  molecular  formula  for  the  unknown  is 
C\2H\s02-  The  primary  structure  generator  pro¬ 
duced  hundreds  of  primary  components  that  were 
consistent  with  this  formula.  Without  any  restric¬ 
tions  generated  by  the  spectroscopy  experts,  over 


100,000  sets  of  secondary  components  were  con¬ 
structed.  The  spectral  data  did  give  the  follow¬ 
ing  information:  In  the  infrared,  an  absorption  at 
1750  cm-1  indicated  that  the  molecule  contained  a 
carbonyl  (C=0).  The  mass  spectrum  had  a  molec¬ 
ular  ion  of  196  consistent  with  the  above  molecular 
formula.  There  was  no  absorption  in  the  ultravi¬ 
olet  spectrum  indicating  a  lack  of  conjugation.  In 
the  proton  NMR,  there  were  four  vinyl  protons, 
two  of  which  were  split  into  a  doublet  (indicating 
a  hydrogen  on  the  adjacent  carbon  atom).  The 
rest  of  the  protons  all  appeared  at  roughly  the 
same  location  allowing  for  little  additional  infor¬ 
mation.  Therefore  the  major  restriction  put  on 
the  structure  generator  was  that  the  possible  sets 
of  primary  components  must  contain  at  least  one 
carbonyl. 

Utilizing  a  blackboard  based  architecture  for  ex¬ 
pert  systems  allows  the  parallelization  of  hetero¬ 
geneous  asynchronous  problems.  A  set  of  nodes  is 
dedicated  to  updating  the  blackboard,  overseeing 
communication  between  the  nodes  and  the  black¬ 
board  and  controlling  what  the  individual  nodes 
are  executing  at  any  one  moment.  The  program 
is  implemented  on  a  cube  of  size  2n,  which  is  sub¬ 
divided  into  two  cubes  of  size  n.  The  first  is  the 
blackboard  cube  or  bcube  which,  as  the  control 
unit,  is  responsible  for  maintaining  and  updat¬ 
ing  the  blackboard,  controlling  how  the  problem 
is  subdivided,  determining  the  focus  of  attention 
in  each  module,  handling  any  dynamic  load  bal¬ 
ancing  as  the  solution  progresses  and  acting  as 
the  communication  link  between  the  different  het¬ 
erogeneous  segments  of  the  problem.  The  second 
cube  of  size  n  consists  of  the  processing  nodes  each 
of  which  is  directly  connected  to  one  of  the  nodes 
in  the  bcube.  The  processing  nodes  act  collec¬ 
tively  as  the  knowledge  sources.  All  communica^ 
tion  is  transparent  to  the  processing  nodes  except 
for  their  contact  with  the  blackboard  node  directly 
connected  to  them. 

In  the  following  table  we  summarize  the  per¬ 
formance  of  the  system  for  the  molecular  formula 
C7H12O4.  In  this  case  there  were  93  primary  com¬ 
ponent  vectors  and  497  secondary  vectors.  The 
data  in  the  table  does  not  take  into  consideration 
communication  time  of  results  back  to  the  host. 
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Expert 

Processors 

Time 

Speedup 

1 

450 

4 

115 

3.9 

8 

60 

7.5 

16 

38 

11.0 

32 

26 

17.3 

64 

25 

18.0 

Figure  1:  Times  for  C7H12O4 


5  Conclusions 

In  summary,  successful  structure  elucidation  re¬ 
quires  the  combination  of  several  different  ex¬ 
pertises  such  as  the  chemical  knowledge  of  how 
molecules  are  constructed  from  simple  compo¬ 
nents  and  how  this  process  can  be  restricted  based 
on  different  spectroscopic  data.  The  different 
structure  generation  routines  can  run  simultane¬ 
ously  since  many  different  pathes  must  be  pur¬ 
sued.  To  prevent  combinatorial  explosion,  the 
spectroscopy  experts  can  simultaneously  deter¬ 
mine  restrictions  on  the  structure  generation  rou¬ 
tines  and  check  the  resulting  partial  and  full  solu¬ 
tions.  Control  can  dynamically  reallocate  proces¬ 
sors  to  different  subproblems.  Parts  of  the  struc¬ 
ture  generation  routines  can  be  further  subdivided 
into  homogeneous  subproblems.  Although  a  prob¬ 
lem  such  as  this  will  never  attain  the  efficiency 
of  a  simultaneous  homogeneous  problem  where  all 
processors  are  executing  the  same  code  on  differ¬ 
ent  data,  our  goal  is  not  to  compete  with  such 
problems  but  to  show  that  parallel  processing  on 
a  hypercube  can  dramatically  speed  up  the  execu¬ 
tion  of  problems  which  were  originally  considered 
inappropriate  for  such  an  architecture. 
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Abstract 

Production  system  implementations  of  expert  sys¬ 
tems  are  becoming  more  prevalent  in  a  large  number 
of  diverse  specialties,  but  the  relatively  slow  execution 
speed  of  these  systems  precludes  their  use  in  most  real¬ 
time  applications.  The  feasibility  of  improving  the  per¬ 
formance  of  production  systems  software  using  parallel 
computer  architectures  is  an  area  of  significant  interest 
in  artificial  intelligence  research.  One  of  the  parallel 
computer  architectures  of  interest  is  message-passing 
multicomputers.  Production  parallelism  attempts  to 
apply  parallelism  at  a  very  high  level  in  order  to  es¬ 
tablish  problem  granularities  that  are  compatible  with 
current  generation  multicomputer  architectures.  Even 
when  using  larger  grain  parallelism,  high  performance 
production  system  shells  represent  a  problem  for  some 
multicomputer  architectures.  The  problems  related  to 
implementing  a  production  system  shell  on  a  message¬ 
passing  multicomputer  and  mapping  a  specific  applica¬ 
tion  to  this  implementation  are  analyzed.  Further  re¬ 
search  into  the  problems  of  parallelizing  real-time  pro¬ 
duction  systems  applications  is  also  discussed. 

Introduction 

Intelligent  real-time  control  and  real-time  monitor¬ 
ing  tasks  are  one  of  the  most  challenging  new  envi¬ 
ronments  for  computer  applications.  Efforts  to  apply 
traditional  software  implementation  methods  to  some 
of  these  areas  have  typically  met  with  failure  due  to  the 
unmanagability  of  the  associated  computer  code  [15, 
77],  Researchers  are  discovering  that  complex  tasks 
performed  by  trained  human  (like  piloting  an  aircraft) 
lend  themselves  to  solution  using  artificial  intelligence 
(AI)  approaches.  One  of  the  most  successful  areas  of 
A I  research  is  in  the  area  of  expert  systems,  computer 


programs  capable  of  emulating  the  problem  solving  ca¬ 
pabilities  of  a  human  expert  in  a  specific  field  of  knowl¬ 
edge  [4,  5].  The  most  significant  problem  facing  the  use 
of  expert  systems  for  real-time  tasks  is  their  slow  exe¬ 
cution  speeds.  A  real-time  requirement  means  “there 
is  a  strict  limit  by  which  the  system  must  have  pro¬ 
duced  a  response  to  environmental  stimuli  regardless 
of  the  algorithm  it  employs”[14,  10].  A  number  of  ex¬ 
pert  systems  proposed  and  developed  for  real-time  use 
are  not  capable  of  sustaining  this  level  of  performance 
[10,  27). 

The  feasibility  of  improving  the  performance  of  real¬ 
time  expert  systems  (particularly  those  based  on  the 
production  system  paradigm)  using  parallel  computer 
architectures  is  an  area  of  current  interest  in  AI  re¬ 
search.  The  performance  requirements  of  real-time 
productions  systems  motivates  the  researcher  to  elim¬ 
inate  common  expert  system  inefficiencies.  These  in¬ 
efficiencies  include  the  overhead  associated  with  sym¬ 
bolic  languages  (e.g.  Lisp,  Prolog)  and  the  dispropor¬ 
tionately  large  amount  of  time  spend  in  the  pattern 
matching  phase  of  a  production  system’s  match-select- 
act  cycle.  Despite  the  use  of  more  efficient  languages 
(such  as  C)  and  match  algorithms  (such  as  Rete),  the 
problem  related  to  match  phase  efficiency  persists  [5, 
14).  This  problem  becomes  the  focus  of  increasing  the 
execution  speed  of  expert  systems  through  the  use  of 
parallel  computer  architectures. 

Background 

Robotic  Air  Vehicle  Application 
One  sponsor  of  research  in  this  area,  the  Air  Force 
Wright  Aeronautical  Laboratories  (AFWAL),  requires 
a  fast  multiprocessor  architecture  to  process  an  expert 
system  capable  of  piloting  a  robotic  air  vehicle  (RAV) 
[12,  1326].  Under  contract  with  Texas  Instruments 
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Figure  1:  RAV  System  Architecture  [12,  1327] 


(TI),  the  RAV  v/as  developed  as  a  production  system 
based  expert  system  using  the  Automated  Reasoning 
Tool  (ART)  implemented  on  several  TI  Explorer  Lisp 
machines.  Figure  1  shows  the  overall  structure  of  the 
current  RAV  implementation  including  the  piloting  ex¬ 
pert  system  and  airspace  expert  system.  Both  the 
piloting  and  airspace  expert  systems  applications  are 
based  on  the  production  system  paradigm.  Unfortu¬ 
nately,  the  RAV  expert  system  has  proven  to  be  too 
compute-intensive  to  yield  results  in  real-time  on  any 
serial  or  parallel  computer  architecture  (hardware  and 
software)  developed  to  date  [2]. 

Production  Systems 

The  production  system  paradigm  is  one  of  the  most 
common  methods  for  implementing  expert  systems  ap¬ 
plications.  Production  systems  apply  pattern  directed 
search  (inference)  using  rules  which  are  based  on  a 
subset  of  first  order  predicate  logic.  The  execution  of 
these  rules  modifies  a  set  of  facts  which  describe  the 
current  state  of  a  specific  problem.  The  following  ex¬ 
panded  definition  describes  a  production  system  by  its 
basic  components  [11,  185]: 

•  a  set  of  facts,  collectively  known  eis  working  mem¬ 
ory  (WM)  that  describe  the  current  state  of  a 
problem.  A  common  way  of  expressing  rules  is 
as  an  object-aiiribuie-value  triple  [5,  8]  such  as; 

(autopilot-switch  position  off) 

•  a  set  of  rules,  collectively  known  as  the  rule  base. 
Each  rule  represents  an  element  of  problem  solv¬ 
ing  knowledge  for  a  specific  application.  A  Rule 
typically  contains  1  or  more  condition  elements 
(CEs)  (i.e.  logical  predicates)  that  must  be  sat¬ 


isfied  in  order  for  them  to  be  applied  to  solving 
the  problem.  A  CE  can  be  thought  of  as  an  ob¬ 
ject  with  an  instantiated  values,  capable  of  being 
matched  with  facts  in  WM.  Rules  have  the  gen¬ 
eral  form; 

if  (these  facts  exist)  then  (execute  this  action) 

•  a  control  structure  known  as  a  produciion  cycle 
that  performs  the  generalized  search  of  the  prob¬ 
lem  state-space: 

1.  Match  -  evaluate  the  “if’  part  of  each  rule 
to  determine  whether  it  is  consistent  with  the 
current  contents  of  WM.  This  is  equivalent 
to  producing  all  possible  extensions  of  the 
current  state  of  a  problem. 

2.  Select  -  choose  one  rule  from  those  passing 
the  match  test,  if  no  rules  are  eligible,  termi¬ 
nate  execution.  This  is  essentially  the  same 
as  choosing  the  most  promising  extension  of 
the  current  state  (path). 

3.  Act  -  perform  the  actions  specified  in  the 
“then”  part  of  the  chosen  run  and  return  to 
match  phase  unless  an  explicit  stop  condition 
exists.  This  phase  updates  the  current  state 
of  the  problem  (WM),  forming  a  new  state. 

The  Rete  Match  Algorithm 
The  match  phase  of  the  production  cycle  accounts  for 
approximately  90%  of  the  total  execution  time  in  most 
production  systems  [6,  70]  and  therefore  becomes  the 
focus  for  increasing  production  system  performance 
though  the  use  of  specialized  algorithms.  Rete  is  an 
efficient  match  algorithm  which  exploits  two  preva¬ 
lent  characteristics  of  most  production  system  applica¬ 
tions  to  obtain  significant  increases  in  execution  speed: 
First,  in  a  typical  production  system  application,  only 
a  small  fraction  of  the  WM  changes  during  each  pro¬ 
duction  cycle.  Rete  takes  advantage  of  these  small 
WM  changes  by  storing  CEs  satisfied  during  previous 
production  cycles  and  using  them  to  aid  in  satisfy¬ 
ing  rules  during  subsequent  production  cycles.  This 
“state  saving”  feature  means  that  only  recent  changes 
to  WM  need  to  be  evaluated  during  the  current  match 
phase  instead  of  evaluating  the  entire  WM.  Second, 
Rete  exploits  the  commonality  of  CEs  in  the  “if”  side 
of  the  rules  in  the  rule  base.  By  eliminating  redun¬ 
dant  CEs,  the  Rete  algorithm  evaluates  a  given  CE 
only  once  each  cycle  even  though  this  CE  may  exist  in 
the  “if’  side  of  a  number  of  rules.  Eliminating  redun¬ 
dant  CEs  tends  to  significantly  reduce  the  number  of 
pattern  matching  tests  that  need  to  be  accomplished 
during  each  production  cycle  [3,  35]. 

The  Rete  algorithm  uses  a  special  type  of  constraint 
network  compiled  from  the  “iP  side  of  rules  in  the  rule 


Figure  2:  A  Single  Rule  Rete  Network 


base  to  perform  efficient  pattern  matching.  Figure  2 
shows  an  example  of  a  simple  Rete  network  compiled 
from  a  single  rule.  The  Rete  network  is  generated  at 
compile  time,  before  the  production  system  applica¬ 
tion  is  actually  executed.  At  run-time,  entities  (i.e. 
tokens)  containing  a  flag,  a  list  of  time  tags  and  a  list  of 
variable  bindings  flow  through  the  Rete  network  dur¬ 
ing  each  match  phase.  Each  token  flows  through  the 
network  only  until  a  match  for  its  list  of  variable  bind¬ 
ings  is  not  possible.  A  token  reaching  the  bottom  of 
the  graph  contains  a  complete  and  valid  list  of  variable 
bindings,  making  it  eligible  for  execution.  [3,  38]. 

The  Rete  network  contains  three  basic  types  of 
nodes  [7,  688]; 

•  Constant  Test  Nodes  -  test  for  consistency  be¬ 
tween  attribute  values  of  individual  CEs,  such 
as  binding  the  aircraft  objects  location  attribute 
value  to  “a”  in  figure  2.  These  nodes  appear  in 
the  top  part  of  the  network  and  take  less  than 
10%  of  the  total  Rete  network  update  time. 

•  Memoi-y  Nodes  -  store  the  results  of  previous 
match  phases  for  use  in  the  current  match  phase. 
The  state  stored  by  these  nodes  consists  of  a  list 
of  all  previous  tokens  that  match  the  CEs  bound 
up  to  this  point  in  the  network.  This  means  that 
only  changes  made  to  WM  by  the  most  recent  rule 
firing  need  to  be  considered  during  the  current 
cycle. 

•  Two  Input  Nodes  -  check  consistency  of  vari¬ 
able  binding  between  two  different  token  inputs. 
A  new  token  is  propagated  if  and  only  if  there  is 
complete  consistency  between  the  variable  bind¬ 
ings  of  the  two  inputs.  For  example,  the  “and” 


node  in  figure  2  tests  the  consistency  of  binding 
between  the  aircraft  and  target  objects.  Because 
the  location  attribute  value  of  both  objects  match, 
a  joined  token  is  propagated. 

Parallelizing  Rete 

Extensive  research  by  Gupta  shows  that  the  Rete 
match  algorithm  is  suitable  for  efficient  parallel  pro¬ 
duction  system  implementations.  The  data  flow  na¬ 
ture  of  the  Rete  algorithm  makes  it  possible  to  execute 
the  actions  of  a  number  of  Rete  nodes  in  parallel.  It’s 
also  possible  to  process  multiple  changes  to  WM  in 
parallel.  The  following  two  general  methods  are  typ¬ 
ically  considered  in  parallelizing  the  Rete  algorithm; 
produciion  parallelism  and  node  parallelism  [5,  19]. 

Production  parallelism  divides  an  application’s  rule 
base  among  a  number  of  processors.  Each  of  these 
processors  then  performs  the  match  phase  on  its  par¬ 
tition  of  the  complete  rule  base.  Because  production 
parallelism  constitutes  a  static  partitioning,  its  execu¬ 
tion  involves  very  little  communication  between  pro¬ 
cessors.  Hence  it  is  a  relatively  large  grain  problem. 
Unfortunately,  this  static  partitioning  can  lead  to  large 
variances  in  processing  time  between  processors  thus 
limiting  achievable  speed-up  [5,  46]. 

Node  level  parallelism  attempts  to  execute  the  ac¬ 
tions  of  a  number  of  Rete  network  node  in  parallel 
using  different  processors.  Node  level  parallelism  re¬ 
quires  a  minimum  of  two  communications  per  node 
processing  action  making  it  a  relatively  small  grain 
problem.  Despite  the  heavy  communication  costs  as¬ 
sociated  with  node  parallelism,  it  allows  for  efficient 
and  balanced  use  of  all  processors  [5,  49]. 

A  significant  amount  of  research  has  been  directed 
toward  implementing  Rete  on  multiprocessors,  but  the 
prospects  of  implementing  Rete  on  multicomputers 
remains  relatively  unexplored.  Multiprocessors  have 
been  favored,  because  simulation  shows  node  paral¬ 
lelism  is  superior  to  production  parallelism  [5,  55], 
and  the  small  granularity  of  node  parallelism  is  not 
efficiently  compatible  with  the  communications  facili¬ 
ties  of  most  message  passing  multicomputers  [7,  687]. 
Limited  theoretical  analysis  of  a  parallel  architecture 
which  implements  Rete  in  an  object-oriented  manner 
on  a  multicomputer  has  been  performed,  but  it  draws 
heavily  on  previous  multiprocessor  design  concepts  [7, 
689). 

Research  Methodology 

It  is  common  practice  for  parallel  computer  re¬ 
searchers  to  compare  the  performance  of  their  design 
with  current  state-of-the-art  performance  applied  to 
the  same  problem,  however  this  comparison  actually 
tells  us  very  little  about  the  relative  merits  of  the  de- 
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Figure  3:  Example  Performance  Spectrum  Chart 


sign.  The  direct  comparison  of  designs  is  necessary 
to  show  advancement  of  knowledge  in  the  field,  but 
there  are  two  basic  reasons  why  this  limited  analy¬ 
sis  is  generally  not  adequate:  First,  new  parallel  de¬ 
signs  are  often  configured  to  run  on  different  machines 
where  speedups  attributable  to  different  hardware  can¬ 
not  be  distinguished  from  speedups  attributable  solely 
to  new  program  design.  Second,  it  is  not  enough  when 
the  new  parallel  design  outperforms  the  existing  one, 
because  the  performance  of  both  designs  may  fall  far 
short  of  the  required  performance  for  the  proposed  ap¬ 
plication. 

In  order  to  determine  the  true  merit  of  a  parallel 
design  (for  a  given  implementation),  one  must  deter¬ 
mine  where  on  the  performance  spectrum  the  current 
implementation  lies  in  terms  of  processing  speed.  This 
processing  speed  is  defined  in  terms  that  are  significant 
to  the  particular  application  (e.g.,  for  expert  systems, 
the  performance  metric  is  typically  expressed  as  the 
average  number  of  rules  fired  for  every  second  of  run¬ 
time).  By  looking  at  the  performance  of  the  current 
implementation  with  respect  to  other  important  met¬ 
rics,  the  actual  merit  of  the  current  implementation 
can  be  seen  more  clearly. 

The  performance  spectrum  encompasses  five  important 
measures  of  performance  for  determining  the  relative 
merits  of  the  design  and  for  guiding  the  research  for¬ 
ward  in  a  logical  and  methodical  manner.  Figure  3 
shows  an  example  performance  spectrum  based  on  the 
five  measures  of  performance  explained  below: 

1.  The  state-of-the-art  performance  (prior  to  this  re¬ 


search)  for  this  application  when  implemented  on 
a  parallel  computer  architecture. 

2.  The  minimum  goal  performance  value  in  terms  of 
execution  speed  required  for  this  application  to  be 
feasible.  This  measure  is  independent  of  computer 
architecture  considerations. 

3.  The  lower  bound  performance  determined  by  the 
processing  speed  of  this  application  using  the  best 
known  sequential  algorithm  implemented  on  a 
comparable  sequential  computer  architecture. 

4.  The  theoretical  upper  bound  performance  for  this 
application  based  on  the  selected  parallel  algo¬ 
rithm  and  its  implementation  on  a  specific  parallel 
computer  architecture  under  ideal  conditions. 

5.  The  actual  measured  performance  of  the  new  par¬ 
allel  implementation.  Ideally,  this  measurement 
should  be  available  for  different  numbers  of  pro¬ 
cessors. 

This  performance  spectrum  approach  results  in  two 
important  contributions:  First,  at  any  point  in  time,  if 
the  research  does  not  appear  promising,  the  researcher 
has  the  choice  of  proceeding  with  the  current  design, 
altering  the  current  design  or  falling  back  on  another 
possible  design.  Second,  the  performance  spectrum  in¬ 
dicates  whether  it  is  theoretically  possible  to  attain  the 
desired  goal  performance  using  the  chosen  combina¬ 
tion  of  algorithm  and  architecture.  Performance  spec¬ 
trum  information  also  supports  intelligent  decisions  on 
whether  to  proceed  with  refinements  of  the  current  de¬ 
sign  or  to  try  new  designs  that  eliminate  weaknesses 
of  the  current  implementation. 

Previous  Best  Performance 

The  performance  of  any  new  parallel  architecture  must 
be  compared  to  that  of  any  existing  parallel  architec¬ 
ture  that  is  considered  state-of-the-art.  The  only  par¬ 
allel  architectures  applied  to  the  RAV  expert  system 
are  those  developed  by  Donald  Shakley  [15].  This  ef¬ 
fort  concentrated  on  a  Lisp  implementation  of  the  RAV 
on  a  first  generation  Intel  Hypercube  (iPSCl)  multi¬ 
computer.  Using  production  parallelism  and  the  up¬ 
dated  Winston  and  Horn  inference  engine,  Shakley’s 
iPSC/1  implementation  fired  an  average  of  1  rule  ev¬ 
ery  11  seconds  with  a  peak  performance  of  0.5  rules 
every  second  using  16  processofs  [16,  1352]. 

Although  this  study  shows  that  the  increased  par¬ 
allelism  of  the  iPSC/1  design  could  in  fact  produce 
processing  speedup  compared  to  the  TI  Explorer  de¬ 
sign,  the  amount  of  speedup  realized  was  hampered  by 
the  effects  of  interprocessor  communication  overhead, 
load  imbalance,  and  modelling  errors.  The  program 
design  was  implemented  in  Common  Concurrent  Lisp 
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Figure  4:  iPSCl  RAV  Performance  Results 


on  the  iPSCl,  which  is  significantly  different  than  the 
proposed  environment  for  this  research.  Without  reim¬ 
plementing  Shakely’s  program  on  the  iPSC/2,  a  direct 
comparison  with  the  new  parallel  design  is  not  possi¬ 
ble.  Reimplementing  the  code  on  the  iPSC/2  was  not 
within  the  scope  of  this  research;  however  by  scaling 
the  previous  iPSC/1  results,  a  more  direct  compari¬ 
son  is  possible.  Accounting  for  the  differences  in  the 
iPSC/1  and  iPSC/2  architectures,  8  rules  per  second 
represents  an  optimistic  upper  bound  on  Shakely’s  im¬ 
plementation  on  the  iPSC/2. 

Goal  Performance 

The  goal  performance  for  a  real-time  application  de¬ 
termines  how  “fast”  a  given  design  must  be  in  order 
to  meet  the  feasibility  requirements  of  the  application, 
and  this  figure  in  turn  influences  the  design  of  possi¬ 
ble  solutions.  AFWAL  managers  were  interviewed  to 
determine  what  the  best  estimate  of  the  real-time  re¬ 
quirements  for  the  RAV  system  are  and  the  critical 
points  of  the  RAV  mission  are.  The  estimated  goal 
performance  for  the  RAV  application  was  determined 
by  the  rate  of  the  incoming  data  and  the  number  of 
rules  needed  to  process  this  data  and  provide  control 
outputs  before  the  next  set  of  data  is  received.  The 
most  critical  parts  of  the  RAV  application  involve  low- 
level  route  following  and  final  approach  to  landing. 
During  these  phases  of  the  RAV  mission,  the  system 
design  must  be  capable  of  sustained  firing  of  between 
37  and  75  rules  every  second  [2]. 

Lower  Bound  Performance 

A  “good”  parallel  implementation  typically  starts  with 
a  “best”  or  at  least  “good”  serial  implementation  in 
terms  of  algorithms  and  data  structures.  Parallelizing 
a  less  than  optimum  serial  algorithm  is  typically  justi¬ 
fied  only  when  the  optimum  algorithm  is  not  amenable 
to  parallelism.  If  a  less  than  optimum  algorithm  is  se¬ 
lected,  increased  performance  through  parallelism  is 


required  just  to  obtain  the  optimum  serial  algorithm’s 
level  of  performance  [8,  122].  For  this  research,  the 
processing  speed  of  a  “good”  serial  design  is  needed 
to  delineate  a  lower  bound  on  performance  expected  in 
processing  the  RAV  expert  system.  Furthermore,  the 
serial  design  must  be  implemented  on  hardware  simi¬ 
lar  to  the  hardware  upon  which  the  subsequent  parallel 
designs  are  implemented  if  the  comparisons  are  to  be 
valid. 

The  proposed  serial  design  employs  an  existing  ex¬ 
pert  system  inference  engine  or  shell,  which  uses  the 
Rete  algorithm  when  performing  the  match  phase  of 
the  match-select-act  cycle.  The  possible  alternatives 
viewed  were:  Inference  Corporation’s  Automated  Rea¬ 
soning  Tool  (ART),  Official  Production  System,  ver¬ 
sion  5  (OPS5)  from  Carnegie-Mellon  University  and 
the  NASA  developed  C-Language  Integrated  Produc¬ 
tion  System  (CLIPS).  The  original  RAV  system  is  im¬ 
plemented  using  ART,  however  ART  is  available  only 
in  Lisp  and  Bliss  based  versions.  Sequential  versions 
of  OPS5  have  the  same  problem,  but  Carnegie-Mellon 
has  recently  developed  more  efficient  versions  of  OPS5 
expressly  for  parallel  applications.  Unfortunately,  the 
parallel  version  of  OPS5  uses  a  combination  of  C  and 
machine  language  making  the  shell  non-transportable. 
NASA’s  CLIPS  interpreter  is  the  expert  system  shell 
chosen  for  this  research.  The  C  language  composition 
of  CLIPS  supports  transportability  and  its  syntax  is 
similar  to  ART  making  transliteration  of  application 
rule  bases  easier. 

The  lower  bound  performance  implementation  uses 
a  full  CLIPS  interpreter  executing  the  RAV  rule  b2ise 
on  the  host  processor  of  the  iPSC/2.  At  system  ini¬ 
tialization,  the  processor  is  loaded  with  the  entire  RAV 
rule  base  (326  rules)  and  a  set  of  facts  describing  the 
initial  state  (389  facts).  In  AFWAL’s  implementation 
of  the  complete  RAV  system,  several  conventionally 
programmed  subsystems  provide  continuous  simulated 
inputs  to  the  piloting  expert  system  subsystem.  Be¬ 
cause  these  systems  providing  inputs  to  the  RAV  ex¬ 
pert  system  were  not  available  under  this  design,  an¬ 
other  approach  to  providing  these  inputs  was  required. 
The  “if’  side  of  rules  involving  processing  of  input  data 
were  modified  to  obtain  data  from  WM  instead  of  from 
outside  systems.  This  data  was  provided  by  a  set  of 
data  facts  rwserted  along  with  the  initial  facts.  This 
study’s  benchmark  test  simulates  a  limited  RAV  mis¬ 
sion;  in  this  test,  a  total  of  73  rules  are  fired  as  the 
RAV  progresses  through  the  simulated  mission.  Over 
numerous  test  runs,  the  sequential  implementation  av¬ 
eraged  3.5  seconds  to  fire  the  73  rule  benchmark  test. 
The  average  20.9  rule  per  second  figure  now  becomes 
the  lower  bound  performance. 

Upper  Bound  Performance 
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Certain  cissumptions  are  made  at  the  outset  of  this 
analysis  to  model  an  ideal  environment  for  the  RAV 
expert  system  design  executing  on  the  proposed  par¬ 
allel  architecture.  First,  the  model  assumes  perfect 
load  balance  among  the  available  processors  and  that 
no  computational  overhead  is  introduced  through  par¬ 
allelism.  Second,  the  model  assumes  that  the  only 
activity  other  than  computation  on  a  processor  that 
produces  time  cost  is  inter-processor  communication. 
Under  the  HyperCLIPS  design,  the  execution  time  for 
a  single  match-select-act  cycle  is; 

1.  The  mciximum  time  spent  by  any  given  processor 
to  update  its  local  Rete  network  and  select  its 
“best”  local  rule,  plus 

2.  the  time  for  the  processors  to  determine  which 
processor  has  the  global  “best”  rule  through  gray- 
code  compare/exchange  (with  its  neighbor  proces¬ 
sors),  plus 

3.  the  time  for  the  processor  with  the  global  “best” 
rule  to  broadcast  the  actions  associated  with  the 
best  rule  to  all  processors. 

Given  the  time  components  related  to  a  single 
match-select-act  cycle,  it  is  possible  to  calculate  upper 
bound  performances  for  the  complete  RAV  benchmark 
test.  Under  the  assumptions  of  the  model  (even  load 
balance  and  parallelization  overhead),  the  total  exe¬ 
cution  time  should  be  evenly  divided  by  the  number 
of  processors.  This  parallel  computation  model  repre¬ 
sents  a  linear  decrease  in  processing  time  with  respect 
to  the  number  of  processors  used.  The  ejisiest  method 
for  computing  the  total  minimum  communication  time 
involves  finding  the  minimum  communication  time  per 
cycle  and  then  multiplying  by  the  number  of  cycles.  By 
adding  this  minimum  total  communication  cost  to  the 
linear  speedup  figure  representing  the  total  computa¬ 
tion  time,  the  total  minimum  computation  time  can  be 
determined.  Using  N  processors  running  a  restricted 
rule  base  with  C  cycles,  the  upper  bound  performance 
of  the  proposed  design,  in  seconds  can  be  expressed  as; 
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where  tpAR  is  the  parallel  computation  time,  tssQ 
is  the  sequential  computation  time,  <sE  is  the  select- 
exchange  time  and  Iab  is  the  act-broadcast  time. 

The  task  of  determining  the  actual  upper  bound  per¬ 
formance  now  becomes  that  of  acquiring  actual  times 
for  the  variables  in  the  previous  upper  bound  equation. 
This  process  can  be  significantly  simplified  if  we  con¬ 
sider  the  computation  and  communication  costs  sep¬ 
arately.  Finding  the  minimum  computation  time  is  a 
relatively  straightforward  task.  From  the  lower  bound 


performance  measurements,  we  know  that  the  serial 
implementation  of  CLIPS  requires  3.5  seconds  to  exe¬ 
cute  the  73  cycle  restricted  rule  base.  The  minimum 
parallel  computation  time  now  becomes  the  serial  pro¬ 
cessing  time  divided  by  the  number  of  processors  used. 

In  comparison  with  the  computation  time,  deter¬ 
mining  the  minimum  communication  time  is  a  some¬ 
what  more  involved  task,  but  provides  a  more  accurate 
indication  of  the  actual  cost.  A  total  of  d  -f  1  commu¬ 
nications  between  processors  is  required  to  determine 
which  processor  has  the  best  candidate  rule,  where  d  is 
the  dimension  of  the  cube  being  used.  Only  one  broad¬ 
cast  communication  is  required  for  the  processor  with 
the  “best”  global  rule  to  send  the  actions  associated 
with  that  rule  to  all  other  processors. 

Using  a  simple  message  passing  ring  program,  it  was 
determined  that  a  compare  and  exchange  communica¬ 
tion  requires  0.00424  seconds  and  the  average  action 
message  broadcast  requires  0.00776  seconds  [9,  6-5]. 
The  iPSC/2’s  2.8  Mbytes  per  second  interconnection 
network  is  capable  of  transferring  each  compare  and 
exchange  communication  in  0.0000825  seconds  and  the 
broadcast  message  in  0.0001678  seconds.  These  times 
are  two  orders  of  magnitude  less  than  the  message 
transmission  times  actually  observed.  It  becomes  obvi¬ 
ous  that  overheads  associated  with  preparing  the  data 
for  transmission,  setting  up  the  message  transmission 
path  and  converting  the  message  to  data  at  the  re¬ 
ceiving  node  account  for  most  of  the  actual  message 
passing  time. 

Given  the  lower  bound  processing  time  and  commu¬ 
nication  time  figures,  the  upper  bound  performance 
(in  seconds)  can  be  expressed  as; 

(^)  +  iiiid  +  1)  X  0.00424)  +  0.00776)  x  73) 

where  N  equals  the  number  of  processors  and  N  =  2*^. 
Figure  6  shows  the  results  of  this  analysis  applied  to 
different  numbers  of  processors. 

HyperCLIPS  Design  and  Performance 
The  HyperCLIPS  system  design  implements  a  paral¬ 
lel  production  system  interpreter  in  CLIPS  using  the 
production  parallelism  concept.  This  design  supports 
the  use  of  parallelism  in  all  three  phases  of  the  match- 
select-act  cycle.  A  general  high  level  description  of  the 
HyperCLIPS  algorithm  follows; 

1.  While  termination  is  not  detected,  continue 

2.  Parallel  Match 

•  each  processor  receives  WM  changes  from  a 
root  processor 

•  each  processor  updates  its  local  Rete  net¬ 
work  based  on  WM  changes 
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Figure  6:  Experimental  Performance  (RAV) 


Figure  5:  Upper  Bound  Performance  Metric 


3.  Parallel  Local  Select 

•  each  processor  selects  the  “best”  rule  from 
its  local  conflict  set 

4.  Global  Select 

•  processors  perform  compare/exchange  of  lo¬ 
cal  “best”  rule  priority  with  neighboring  pro¬ 
cessors 

•  root  processor  holds  the  “best”  global  rule 
when  compare/exchange  is  complete 

5.  Broadceist  Global  Act 

•  root  processor  broadcasts  WM  change  spec¬ 
ified  by  the  global  “best”  rule’s  “then”  side. 

6.  Return  to  step  1 

Under  the  concept  of  the  HyperCLlPS  design,  each 
active  processor  supports  a  full  production  system  in¬ 
terpreter;  each  processor  executes  this  CLIPS  shell 
program  on  a  subset  of  the  total  rule  base.  At  sys¬ 
tem  initialization,  the  iPSC2  host  processor  loads  all 
working  processors  with  an  approximately  equal  sub¬ 
set  of  the  rule  base.  This  research  makes  no  attempt 
to  allocate  rules  in  an  optimum  manner  with  respect 
to  load  balance  among  processors.  Instead,  an  ad- 
hoc  allocation  is  used  to  distribute  the  rules  evenly 
among  processors.  With  this  static  decomposition  ap¬ 
proach,  no  interprocessor  communication  is  required 


during  the  match  phase.  Each  processor’s  local  match 
phase  requires  no  communication  because  all  informa¬ 
tion  needed  to  update  the  Rete  network  is  already  local 
to  that  processor. 

The  most  significant  modification  to  the  serial 
CLIPS  code  involves  adding  the  message  passing  capa¬ 
bilities  between  processors.  For  the  compare/exchange 
communication,  each  processor  exchanges  messages 
with  all  its  nearest  neighbor  processors  in  the  binary 
N-cube  network.  This  message  contains  only  the  pri¬ 
ority  of  a  given  processor’s  “best”  rule.  The  processor 
with  the  highest  priority  rule  becomes  the  root  pro¬ 
cessor  and  broadcasts  the  action  associated  with  its 
chosen  rule  to  all  other  processors.  This  communica¬ 
tion  is  simply  an  ASCII  string  that  is  processed  by  the 
CLIPS  interpreter  on  each  processor. 

The  upper  bound  prediction,  shown  in  figure  5,  in¬ 
dicates  that  almost  no  speed-up  can  be  expected  by 
this  design  and  indeed,  the  results  are  somewhat  dis¬ 
couraging.  Figure  6  shows  that,  as  the  number  of  pro¬ 
cessors  increases,  the  amount  of  actual  speedup  actu¬ 
ally  decreases.  On  the  surface,  it  appears  that  com¬ 
munication  overhead  is  the  culprit,  but  further  anal¬ 
ysis  shows  it  to  be  only  a  contributing  factor.  The 
communications  costs  shown  in  figure  5  represents  an 
accurate  measurement  of  the  average  communication 
times  whereas  the  computation  times  are  optimistic. 
This  knowledge  allows  us  to  make  a  meaningful  assess¬ 
ment  of  the  load  balance  among  processors  achieved  in 
this  research.  Comparing  the  upper  bound  results  with 
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those  in  figure  6,  it  is  obvious  that  the  load  balance 
among  processors  is  far  from  optimum.  This  problem 
raises  the  question  of  whether  an  efficient  partitioning 
of  the  RAV  expert  system  is  possible. 

Performance  Comparison  Findings 
This  research  indicates  that  the  basic  methods  behind 
this  design  are  promising,  but  the  design  and  imple¬ 
mentation  suffer  due  to  influences  of  the  parallel  archi¬ 
tecture  chosen  and  the  characteristics  of  the  applica¬ 
tion  itself.  The  lower  bound  performance  experienced 
by  the  serial  CLIPS  design  is  impressive  (especially 
considering  it  is  implemented  on  a  micro-computer), 
but  it  still  falls  short  of  the  minimum  real-time  require¬ 
ments.  CLIPS  performs  particularly  well  because  it 
takes  advantage  of  the  Rete  state-saving  algorithm  for 
the  match  phase  of  the  match-select-act  cycle.  This  is 
largely  the  reason  that  the  serial  version  outperforms 
Shakley’s  parallel  implementation  [15,  62]. 

The  results  show  that  the  HyperCLIPS  implemen¬ 
tation  on  the  iPSC/2  will  not  produce  effective  results 
for  the  RAV  application.  The  upper  bound  on  perfor¬ 
mance  indicates  that  almost  no  speed-up  is  achievable 
using  the  HyperCLIPS  design  due  primarily  to  large 
communication  overheads.  Even  when  using  minimum 
communication  between  processors,  the  overhead  in¬ 
curred  constrains  speedup  to  1.13  times.  In  the  case 
of  the  RAV  application,  it  is  both  load  imbalance  and 
communication  overhead  that  conspire  to  produce  neg¬ 
ative  speedups.  The  ad-hoc  partitioning  method  used 
for  the  RAV  application  fails  to  effectively  balance  the 
processing  load.  The  results  indicate  that  using  mul¬ 
tiple  processors  and  ad-hoc  rule  partitioning  does  not 
reduce  the  serial  processing  time.  These  results  war¬ 
rant  an  investigation  of  the  RAV  expert  system  char¬ 
acteristics  to  determine  why  the  use  of  multiple  pro¬ 
cessors  does  not  appear  to  decrease  execution  time. 
Additionally,  the  use  of  some  algorithmic  mechanism 
to  partition  rules  among  processor  for  effective  load 
balance  must  be  investigated. 

The  Partitioning  Problem 

The  characteristics  of  expert  systems  applications 
becomes  significant  when  considering  the  optimum 
manner  for  parallelizing  them.  Empirical  measure¬ 
ments  by  Gupta  and  Forgy  uncover  two  of  the  more 
vital  characteristics  of  production  systems  [6,  93): 

•  The  affecUset  (the  set  of  rules  affected  by  a  given 
rule  firing)  is  generally  very  small  with  respect  to 
the  total  number  of  rules  in  the  application. 

•  The  size  of  the  affect-set  does  not  increase  as  the 
number  of  rules  in  the  application  increases,  in¬ 
stead  it  remains  approximately  constant  (between 
25  to  40  for  the  systems  measured). 


Figure  7:  Actual  Performance  Spectrum  (RAV) 


Gupta  discusses  the  affect-set  problem  and  gives  sev¬ 
eral  plausible  reasons  why  this  problem  exists  in  pro¬ 
duction  systems.  If  we  subscribe  to  Gupta’s  assertions 
that  small  affect-sets  are  basically  an  inherent  part  of 
production  systems,  then  production  parallelism  be¬ 
comes  significantly  less  attractive  as  the  following  sec¬ 
tion  discusses. 

Production  parallelism  relies  heavily  on  the  exis¬ 
tence  of  fortuitous  rule-to-processor  allocations  in  a 
given  application.  In  order  for  production  parallelism 
to  produce  significant  speedups,  the  computational 
load  associated  with  the  match  phase  must  be  as 
evenly  distributed  as  possible.  Considering  the  Rete 
algorithm,  this  means  that  the  computational  load  of 
updating  the  local  Rete  network  on  each  given  pro¬ 
cessor  must  be  as  equal  as  possible.  Oflazer  reminds 
us  that  this  problem  must  be  considered  for  the  case 
of  firing  every  rule  in  the  rule  base,  particularly  those 
rules  that  fire  more  often  than  others  [13,  96].  Oflazer 
proves  that  this  problem  is  NP-Complete,  but  also  de¬ 
scribes  an  efficient  heuristic  approach  for  rule  parti¬ 
tioning  [13,  94].  More  recently,  Dixit  and  Moldovan 
describe  a  method  for  allocating  rules  to  processors 
that  is  independent  of  Rete-bcised  implementations. 
Their  motivation  is  based  on  finding  more  generalized 
ways  of  expressing  and  applying  parallelism  in  produc¬ 
tion  systems  that  is  independent  of  algorithm-specific 
details  [1,  24]. 

Oflazer’s  partitioning  algorithm  results  in  rule-to- 
processor  allocations  that  produce  1.15  to  1.25  more 
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speedup  than  ad-hoc  allocation  methods  [5,  155]. 
There  are  several  reasons  that  this  partitioning  does 
not  produce  better  results  [5,  111]: 

1.  The  size  of  the  affect-set  constrains  the  number 
of  processors  that  can  produce  effective  work.  If 
the  number  of  processors  is  larger  than  the  size 
of  the  affect  set,  then  the  there  will  be  some  pro¬ 
cessors  that  do  not  contain  rules  affected  by  the 
firing  of  a  given  rule.  Whenever  this  situation  oc¬ 
curs,  the  processors  without  affected  rules  will  be 
essentially  idle. 

2.  The  time  to  process  different  rules  in  the  affect-set 
can  vary  greatly  depending  on  the  specific  appli¬ 
cation.  If  each  processor  has  one  affected  rule, 
then  the  actual  time  each  processor  takes  to  up¬ 
date  that  rule  will  produce  some  computational 
load  imbalance. 

3.  Loss  in  Rete  network  sharing  will  tend  to  increase 
the  number  of  redundant  computations.  Rules  in 
the  affect-set  are  likely  to  contain  at  least  some 
common  condition  elements.  If  these  rules  are 
allocated  to  different  processors,  then  each  pro¬ 
cessor  will  perform  redundant  computations  while 
updating  its  local  Rete  network. 

Top-level  examination  of  the  RAV  expert  system  in¬ 
dicates  that  it  is  not  particularly  amenable  to  produc¬ 
tion  parallelism  methods.  Analysis  indicates  that  each 
rule  firings  affects  an  average  of  just  four  rules,  with 
an  observed  range  of  between  1  and  18.  Four  rules 
represents  only  1.5%  of  the  273  rules  in  the  RAV  rule 
beise.  These  results  indicate  that  the  typical  affect-set 
for  the  RAV  expert  system  is  significantly  smaller  than 
the  applications  measured  by  Gupta  and  Forgy.  Vari¬ 
ations  in  rule  processing  time  within  the  affect-set  and 
loss  in  Rete  network  sharing  among  processors  have 
not  been  examined.  The  existance  of  small  affect-set 
sizes  is  sufficient  to  explore  the  aissertion  that  the  RAV 
application  is  not  amenable  to  production  parallelism. 

This  discussion  indicates  that  production  paral¬ 
lelism  is  significantly  constrained  by  the  character¬ 
istics  of  the  particular  application,  particularly  the 
size  of  the  affect-set  [5,  113].  The  average  affect-set 
size  is  a  reasonable  determinant  of  the  upper-bound 
speedup  that  can  be  expected  using  production  par¬ 
allelism  if  processing  differentials  and  losses  in  Rete 
network  sharing  are  not  considered.  The  size  of  the 
average  RAV  expert  system  affect-set  (four),  now  be¬ 
comes  the  upper  bound  on  the  number  of  processors 
that  can  be  effectively  used  to  increase  execution  speed 
and  the  upper  bound  on  speedup  as  well.  Other  fac¬ 
tors  not  considered  in  this  analysis  may  influence  the 
upper  bound  of  four  somewhat,  but  they  are  not  likely 


to  influence  it  considerably.  Other  factors  aside,  it  now 
becomes  obvious  that  production  parallelism  is  only  a 
marginally  promising  means  for  increasing  the  execu¬ 
tion  speed  of  the  RAV  expert  system  application  in  its 
current  form  due  to  the  struture  of  its  rule  base. 

Conclusions 

Parallel  processing  is  a  promising  approach  to 
achieving  real-time  processing  of  expert  systems  soft¬ 
ware,  but  a  number  of  impressive  problems  exist  be¬ 
tween  this  concept  and  its  implementation.  The  pri¬ 
mary  problems  that  need  to  be  overcome  are  inter- 
processor  communications  overhead  and  load  balanc¬ 
ing.  The  major  factor  in  minimizing  both  of  these 
problems  involves  the  proper  choice  of  problem  de¬ 
composition  and  the  parallel  computer  architecture. 
Although  the  Rete  match  algorithm  produces  impres¬ 
sive  serial  performance  in  processing  expert  systems,  it 
also  results  in  a  problem  granularity  that  is  marginally 
compatible  with  the  architecture  chosen  for  this  re¬ 
search.  The  results  indicate  that  this  type  of  problem 
requires  the  use  of  parallel  computer  architectures  with 
significantly  less  communication  overhead. 

This  research  goes  beyond  just  producing  a  new  par¬ 
allel  architecture  application.  Although  the  perfor¬ 
mance  of  the  HyperCLIPS  design  on  the  iPSC2  Hy¬ 
percube  was  less  than  impressive,  it’s  performance  is 
quantified  in  terms  of  the  lower  and  upper  performance 
bound  and  the  ultimate  goal  performance.  This  ap¬ 
proach  not  only  adds  validity  to  the  design,  but  ex¬ 
poses  the  level  of  maturity  the  RAV  expert  system 
research  achieves  from  this  research.  Beised  on  this 
approach,  the  findings  indicate  that  research  into  par¬ 
allel  processing  of  the  RAV  expert  system  is  still  in 
its  infancy.  The  processing  speeds  obtained  from  us¬ 
ing  serial  CLIPS  supports  the  continued  use  of  the 
Rete  match  algorithm.  But,  the  characteristics  of  the 
RAV  expert  system  suggest  that  the  system  lends  itself 
to  very  limited  speedup  using  production  parallelism. 
Therefore,  RAV  expert  system  research  may  be  better 
served  by  approaching  parallelism  from  the  standpoint 
of  node  parallelism  rather  than  production  parallelism. 

The  problems  encountered  in  this  research  under¬ 
score  the  need  for  significantly  more  background  re¬ 
search  in  the  area  of  real-time  expert  systems.  Most 
previous  research  in  applying  parallelism  to  production 
systems  concentrates  on  “stationary”  types  of  prob¬ 
lems  instead  of  real-time  types  of  problems.  Process¬ 
ing  speed  is  important  in  a  number  of  computation¬ 
ally  intensive  “stationary”  applications,  but  the  need 
for  additional  processing  speed  in  real-time  systems  is 
critical.  Laffey  notes  that  Gupta  and  Forgy’s  asser¬ 
tions  about  the  level  of  parallelism  in  the  average  pro¬ 
duction  system  does  not  typically  apply  to  real-time 
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production  systems.  Real-time  applications  typically 
involve  changes  to  WM  each  cycle  as  a  direct  result  of 
incoming  data.  This  characteristic  of  real-time  appli¬ 
cations  has  the  potential  to  increase  achievable  speed- 
ups  significantly  [10,  40]. 

In  their  article,  Fast  is  not  Real-Time,  O’Reilly 
and  Cromarty  state  that  guaranteed  response  time  is 
just  important  as  fast  processing  capability  in  achiev¬ 
ing  real-time  behavior.  Without  guaranteed  response 
times,  the  system  may  not  be  capable  of  responding 
quickly  enough  in  critical  situations  even  though  its 
average  processing  time  is  very  fast  [14,  249].  The  im¬ 
plication  is  that  average  processing  times  mean  very 
little  in  terms  of  real-time  expert  system  performance. 
Instead,  the  critical  measure  now  becomes:  the  maxi¬ 
mum  time  required  to  process  updates  to  WM  at  any 
given  time.  This  problem  is  likely  to  be  compounded 
by  the  assynchronous  flow  of  data  into  the  WM.  Laf- 
fey  even  cites  research  by  Halley  indicating  that  Rete 
is  not  appropriate  for  real-time  applications  because 
an  upper  bound  on  the  Rete  network  update  times 
cannot  be  accurately  predicted  [10,  40]. 

What  level  of  parallelism  does  exist  within  real-time 
expert  systems  applications?  Can  this  parallelism  be 
successfully  extracted  allowing  significant  increases  in 
speed-up  using  parallel  processing  methods?  Is  it  pos¬ 
sible  to  accurately  predict  the  guaranteed  response 
time  in  these  systems?  Future  research  will  concen¬ 
trate  on  determining  the  feasiblility  of  applying  differ¬ 
ent  levels  of  parallelism  to  solving  the  problems  inher¬ 
ent  in  real-time  expert  system  applications. 
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Abstract. 

For  the  past  20  years,  an  increasing  interest  has  been  devoted 
to  the  sequential  Conjugate  Gradient  Method  for  solving  large 
linear  systems  arising  from  the  modeling  of  physical  problems 
(especially  for  very  large  systems  with  sparse  matrices).  This 
paper  deals  with  the  implementation  on  parallel 
supercomputers  of  a  preconditioned  conjugate  gradient  method 
for  solving  the  corrective  switching  problem  obtained  while 
modeling  the  behavior  of  power  systems  in  electrical 
networks.  This  problem  consists  in  finding  the  successive 
solutions  of  many  close  linear  systems  (not  too  large)  with 
very  ill-conditioned  matrices  (sometimes  even  singular).  We 
present  a  new  method  based  on  the  Preconditioned  Conjugate 
Gradient  algorithm  with  an  original  preconditioning  and  study 
its  parallelization  on  both  shared  and  distributed  memory 
computers. 


1.  Setting  of  the  problem 

During  the  control  of  electrical  networks,  the  operator  must 
ensure  the  system  to  be  in  a  safe  state  (i.e.  to  be  able  to 
protect  the  system  against  incidents  liable  to  occur  in  real 
time).  The  demand  and  the  possibility  of  the  plants  are  such 
that  nuclear  energy  between  two  plants  flows  from  various 
nodes  of  the  network.  The  loss  of  one  element  could  jeopardize 
the  security  of  the  whole  system  by  a  chain  tripping;  in  such 
case,  an  overload  line  occurs  and  without  any  operation  the 
protective  devices  will  act  and  the  line  will  trip  out.  In  actual 
operations  conditions,  the  switching  actions  that  the  operator 
applies  to  the  electrical  network  ensure  that  overloads  will 
disappear  before  the  delayed  protective  devices  go  into  action. 
Such  actions  are  shown  on  the  picture  at  the  end  of  the  paper. 
The  computation  of  switching  actions  is  a  combinatorial 
problem,  very  hard  to  solve.  The  connections  of  the  switching 
elements  are  described  as  discrete  variables.  The  corrective 
switching  problem  corresponds  to  determine  the  various 
possible  solutions  of  the  load  flow  calculation.  Each  such 
situation  requires  to  solve  a  linear  system  where  the  matrices 
have  only  a  few  elements  which  differ  from  each  other. 

Let  us  consider  the  N  consecutive  linear  systems  below; 

(Si)  AiXi  =  bi,  l<i<N 


where  the  matrices  Aj  (of  size  n  by  n)  are  "close"  to  each 
other,  viz,  Aj+i  =  Aj+Aj,  with  Aj  of  small  norm.  The 
solutions  xj  will  be  close  to  each  other  in  this  sense,  and  we 
want  to  take  full  advantage  of  this. 

Note  that  this  problem  also  occurs  in  Adaptive  Filtering  or 
Fmite  Element  modeling. 


2.  Solving  the  corrective  switching  problem 

The  method  commonly  used  for  solving  this  problem  consists 
of  rcfactorizing  the  matrix  of  each  system  (SO  by  the  direct 
Cholesky  Method  and  solving  it  separetely.  Note  that  the  use 
of  this  method  is  not  available  when  n  becomes  too  large,  for 
both  reasons  of  huge  storage  and  high  rounding  errors. 
Practically  speaking,  n  is  about  100  for  a  typical  corrective 
switching  problem.  However,  during  the  parallelization,  the 
successive  solutions  can  be  obtained  simultaneously  on 
multiple  processors  without  any  communications,  and  the 
local  computations  require  a  load-balanced  amount  of 
operations,  namely  O(n^).  Moreover,  the  solution  will  be 
difficult  to  carry  out  because  the  matrices  are  very  ill- 
conditioned  (they  can  even  be  singular,  that  means  in  practice 
that  we  obtain  several  solutions). 


3.  A  better  algorithm  based  on  the 
Conjugate  Gradient 

A  new  iterative  method  based  on  the  Preconditioned  Conjugate 
Gradient  Method  has  been  proposed  by  the  authors  for 
sequential  computers.  It  takes  into  account  the  small  norm 
variations  of  the  matrices.  The  basic  idea  is  to  factorize  a 
matrix  Ai  (by  Cholesky)  for  a  given  index  i  and  to  use  it  as  a 
preconditioning  of  the  Aj  (for  j>i)  until  this  factorization 
differs  too  much  from  Aj.  We  can  typically  solve  about  50 
linear  systems  with  the  same  preconditioning. 

A  complexity  study  has  proved  the  superiority  of  this  method 
in  regard  to  the  usual  methods,  since  each  step  costs  less  than 
a  Cholesky  factorization  and  the  preconditioning  is  not 
computed  at  each  step,  but  only  for  the  next  reinitialization. 
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The  figure  below  gives  the  numerical  results  on  an  usual 
sequential  computer  (SUN3). 


Figurel. 

Successive  Preconditioned  Conjugate  Gradient  method. 


It  is  well-known  that  the  Preconditioned  Conjugate  Gradient 
algorithm  gives  the  solution  of  a  system  of  size  n  in  less  than 
n  iterations.  Each  iteration  requires,  refering  to  the  basic  linear 
algebra  subroutines  BLASl  (vector  level  operations),  two 
DOTs  (inner  products)  and  three  SAXPYs  (vector  updates), 
plus  the  solving  of  a  system  (usually  not  too  expensive  to 
solve)  and  a  matrix-vector  product  (both  needing  O(n^) 
operations),  which  are  BLAS2  operations  and  can  be  also 
decomposed  into  BLAS 1  vector  elementary  operations,  and  the 
evaluation  of  two  scalar  parameters.  Because  of  both  the  data 
and  operations  regularities,  the  granularity  of  the  tasks  will  be 
taken  as  0(n)  and  are  suitable  for  an  implementation  on  vector 
processing  units. 

This  analysis  leads  to  two  basic  ways  to  implement  the 
successive  Conjugate  Gradient  in  parallel;  first,  we  can  run 
each  Conjugate  Gradient  algorithm  locally  to  the  processors 
(this  first  solution  will  be  limited  by  the  local  memory  size), 
or  we  can  parallelize  successively  the  most  expensive  task  of 
one  Conjugate  Gradient  (the  BLAS2  operations)  to  run  it 
concurrently  on  all  the  processors. 


4 .  Parallel  implementation  on  shared-memory 
computers 

Tlic  parallelization  of  numerical  algorithms  on  this  kind  of 
parallel  computers  has  been  much  studied.  It  is  quite  simple  if 
we  take  into  account  the  analysis  of  the  precedence  constraints. 
The  schedule  of  the  tasks  is  synchronized  even  if  the  numbers 
of  iterations  are  different  on  the  various  processors.  The  large 
shared-memory  allows  every  processors  to  get  easily  ail  the 
global  informations  (like  the  knowledge  of  the  Cholesky 
factors  used  as  preconditionings  for  several  systems).  The  use 
of  local  cache  memories  spccds-up  the  execution  time  after  the 
duplication  of  the  common  data. 


The  strategy  consists  in  dispatching  the  various  systems  to  the 
processors.  Note  that  the  shared-memory  vector-computers  are 
limited  by  the  slight  number  of  processors  (no  more  than  10 
in  practice). 


5.  Parallel  implementation  on  distributed- 
memory  computers 

To  find  an  efficient  implementation  of  this  method  is  difficult 
because  successive  sytems  are  solved  with  the  same 
preconditioning  and  we  need  a  global  checking  to  ensure  that 
some  processors  are  not  locally  computing  too  many 
iterations.  Thus,  the  amount  of  computations  is  not  well 
balanced  from  a  processor  to  the  other.  In  order  to  simplify  the 
computation,  an  initial  phase  is  run  where  the  number  of 
linear  systems  to  be  solv^  with  the  same  preconditioning  wUl 
be  determined.  This  initialization  leads  to  a  static  tasks 
allocation. 

We  propose  here  an  implementation  where  a  processor 
computes  first  the  Cholesky  factor  of  A^  and  then  broadcasts  it 
to  the  others.  Each  processor  computes  the  solution  of  a 
system  with  a  given  amount  of  computations  (that  means  a 
given  number  of  iterations,  which  increases  with  the  index  i) 
inversely  proportional  to  its  distance  from  the  sender 
processor. 

Then,  we  asynchreonously  compute  and  broadcast  a  new 
Cholesky  factor  to  the  processors.  Thus,  the  first  processor  to 
receive  the  data  is  the  one  which  has  to  do  the  most  iterations. 
The  aim  is  then  to  find  the  "better"  allocation  for  the 
successive  linear  systems  to  the  processors,  in  fact  the  one 
which  minimizes  the  total  execution  time. 

The  following  figure  gives  the  numerical  results  for  the  same 
example  as  before  for  a  Cholesky  factorization  plus  the 
resolution  by  successive  Conjugate  Gradient  method  until  the 
next  reinitialization. 

The  experiments  have  been  performed  on  a  32  hypercube 
vector  computer  (EPS  T40).  The  various  results  represent  the 
various  levels  of  programming  of  each  processor. 


Figure  2. 

Schedule  of  work  on  the  distributed-memory 
vector-computer 
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Another  way  to  implement  the  corrective  switching  problem  is 
to  distribute  among  the  various  processors  all  the  local 
informations  about  the  physical  problem.  This  approach  will 
completely  change  the  problem  to  solve:  instead  of  having  a 
linear  system,  each  processor  should  exchange  local 
informations  between  neighbor  processors,  perform  local 
elementary  operations  such  as  the  sum  of  the  electric  powers 
stemming  from  a  node  in  all  the  directions. 


6.  Conclusion 

We  conclude  by  numerical  comparisons  between  shared- 
memory  and  distributed-memory  computers  (of  the  “same 
magnitude  order"  of  performances).  Numerical  experiments 
have  been  run  first  on  a  shared-memory  parallel  computer 
(Alliant  FX80  with  8  vector-processors  of  each  16  MFLOPS 
of  peak  performance)  and  a  distributed-memory  parallel 
computer  (FPS  T40  hypercube  with  32  vector-processors  of 
each  12  MFTOPS  of  peak  performance). 

The  experiments  show  the  good  behavior  of  the  successive 
preconditioned  conjugate  gradient  method  for  solving  the 
practical  corrective  switching  problem  on  parallel  computers. 
The  distributed-memory  implementation  is  better  because  of 
the  larger  number  of  processors  and  worst  as  the  shared- 
memory  implementation  for  the  same  number  of  processors, 
as  it  was  expected. 
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Example  of  a  practical 
corrective  switching  problem. 

We  give  below  the  picture  of  a  fragment  of  the  Electrical  High 
Voltage  (EVH)  French  system  in  a  very  strained  situation.  The 
demand  and  the  possibility  of  the  plants  are  such  that  nuclear 
energy  of  Bugey  and  Cruas  plants  flow  from  Ssv.OS71  to 
Vielm  S71.  The  loss  of  one  of  the  Crey-Gen  element  could 
jeopardize  the  security  of  the  whole  system  by  a  chain 
tripping:  in  such  case,  an  overload  of  the  Alber-Gen  line 
occurs  and  without  any  operation  the  protective  devices  will 
act  and  the  line  will  trip  out 
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Abstract 

A  new  parallelization  technique  for  Fault  Simu* 
lation  is  described  that  is  suited  for  message  passing 
based  parallel  processors.  The  problem  is  parallelized 
by  fust  casting  it  in  Dataflow  form  and  then  con¬ 
structing  a  Dataflow  emulator  for  message  passing 
systems.  A  fault  simulator  for  combinational  logic 
has  been  implemented  on  a  Transputer  based  parallel 
processor,  the  IBM  VICTOR  multiprocessor.  Over¬ 
all  performance  has  been  measured  for  several  lo@c 
designs. 


1.  Introduction 

Fault  simulation  [1]  means  the  simulation  of  a 
logic  design  that  has  been  modified  to  reflect  the 
presence  of  a  fault.  Simulating  such  faulty  designs  is 
done,  among  other  things,  to  asses  the  ability  of  a 
proposed  set  of  test  patterns  to  expose  faults  in  the 
real  design.  Typical  faults  that  are  simulated  are  any 
input  pin  or  output  pin  of  any  gate  stuck  at  1  or  stuck 
at  0.  In  principle,  each  such  fault  gives  rise  to  a 
modified  design  that  has  to  be  simulated.  Many  such 
faults  are  equivalent  however,  in  the  sense  that  the 
corresponding  modified  designs  behave  identically. 
Typically  there  arc  on  the  order  of  3  or  4  non- 
equivalent  faults  per  gate  in  a  logic  design.  Simulating 
all  these  modified  designs  is  therefore  very  costly. 

In  developing  our  parallel  fault  simulation  algo¬ 
rithm,  we  considered  various  attributes  that  the  fault 
simulator  should  have.  Not  all  attributes  have  been 
implemented  yet,  but  we  feel  confident  that  they  can, 
with  the  approach  that  we  have  used. 

llow  these  attributes  arc  implemented  depends  on 
the  hardware  characteristics  of  the  parallel  processor. 


The  parallel  processors  that  we  will  consider  in  this 
paper  are  distributed  memory  machines.  Such  a  ma¬ 
chine  consists  of  a  number  of  nodes,  also  called 
processors.  Fach  node  has  a  CPU  and  some  local 
memory,  typically  in  the  order  of  several  megabytes. 
The  nodes  communicate  by  sending  messages  to  each 
other. 

First,  we  wanted  the  simulator  to  be  flexible.  By 
this  we  mean  that  it  should  not  restrict  too  much  the 
range  of  logic  designs  that  can  be  simulated.  For  ex¬ 
ample,  the  simulator  should  be  able  to  handle  easily 
very  large  designs,  designs  with  embedded  memory 
elements  like  latches  and  feedbacks.  Parallel  pattern 
tcchniqucs[2-5]  do  not  have  this  attribute,  because 
they  do  not  handle  memory  elements  or  feedbacks 
efficiently.  The  inability  to  handle  feedbacks  well  also 
excludes  pipeline  techniques[6,  7]. 

Secondly,  the  limit  on  the  parallclizibility  of  a 
given  problem  should  be  determined  by  the  properties 
of  that  problem  and  not  by  hardware  or  software. 
For  example,  parallelization  according  to  the  single- 
controllcr/many-slaves  model  docs  not  have  this 
charactcri-stic,  because  for  sufficiently  many  process¬ 
ors  the  central  controller  becomes  the  bottleneck. 
The  maximum  speedup  is  then  determined  by  how 
fast  the  central  controller  can  work  and  not  by  the 
degree  of  parallclizibility  of  the  problem. 

A  practical  measure  of  how  well  a  problem  has 
been  parallelized  is  the  number  of  processors  Pm  at 
which  the  speedup  curve  flattens  out.  A  good 
parallelization  is  one  where  Pm  depends  only  on  some 
overall  characteristic  of  the  problem,  like  its  size  for 
example. 

Thirdly,  when  the  number  of  processors  increases, 
the  total  available  memory  increases  with  it  and  the 
size  of  the  largest  problem  that  can  be  handled  should 
increase  as  well.  It  does  not  always  work  out  that 
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way  however,  because  of  the  limited  local  memory 
that  is  available  to  each  processor:  the  partitioning 
that  was  used  may  lead  to  an  overflow  of  that  local 
memory  and  if  no  provisions  are  made  to  use  memory 
available  on  other  processors,  even  small  problems 
may  not  be  handled. 

Good  memory  usage  can  be  described  as  follows. 
Assume  that  the  uni-processor  on  which  we  run  the 
uni-processor  version  of  our  algorithm  has  infinite 
memory  and  let  the  memory  required  by  this  se¬ 
quential  version  for  a  given  problem  be  M.  Assume 
there  are  P  processors,  each  with  local  memory  of  size 
m.  Finally,  let  the  total  memory,  summed  over  all  P 
processors,  that  is  needed  by  the  parallel  algorithm  for 
the  same  problem  be  Mp.  Mp  should  be  almost  inde¬ 
pendent  of  P.  Complete  independence  is  not  possible, 
because  for  example  some  information  about  the 
network  connectivity  has  to  be  stored  somewhere, 
and  this  information  does  increase  with  P.  Memory 
is  then  used  properly  when  Mp  is  bounded  by  cM, 
where  c  is  a  number  that  depends  only  weakly  on  P 
and  is  close  to  1  for  small  P,  and  when  any  problem 
that  can  run  on  a  uni-processor  with  memory  Pm/c 
is  also  guaranteed  to  run  on  the  parallel  processor. 

The  classical  way  of  parallelizing  fault  simulation 
is  by  partitioning  the  fault-list  among  the  available 
processors  [8-10].  This  parallelization  is  straightfor¬ 
ward  because  different  faults  can  be  handled  inde¬ 
pendently.  It  is  very  suitable  for  shared  memory 
machines  where  the  partitioning  of  the  fault  list  can 
be  done  dynamically  [8].  It  has  the  disadvantage  on 
distributed  memory  machines  that  the  complete  de¬ 
sign  has  to  be  replicated  on  all  processors.  Better 
partitioning  algorithms  have  been  designed  [11]  in 
which  each  processor  only  needs  the  description  of  a 
portion  of  the  design,  but  even  then  the  total  amount 
of  memory  required  for  the  design  description  is  con¬ 
siderably  more  than  on  a  uni-processor. 

Finally,  all  processors  should  be  used  as  much  as 
possible  within  the  limits  set  by  the  inherent 
parallelizibility  of  the  problem  itself.  Roughly  speak¬ 
ing,  this  means  that  most  of  the  time  most  of  the 
processors  should  be  doing  useful  work.  If  the  num¬ 
ber  of  processors  is  P  and  the  total  lapse  time  is  t, 
then  the  total  time  taken  by  the  problem  is  Pt.  The 
total  CPU  time  spent  on  the  problem  is  called  Tcpu 
and  is  obtained  by  adding  up  all  the  time  periods  on 
all  processors  when  a  processor  is  working.  The 
(implementation  of  the)  algorithm  is  called  load- 
balanced  when  Pt  is  not  much  larger  than  Tcru-  It  is 


an  efficient  parallelization  when  Tcpu  is  not  much 
larger  than  the  CPU  time  needed  by  the  correspond¬ 
ing  sequential  version  of  the  algorithm  on  a  sin^e 
processor. 

These  requirements  lead  to  some  important  con¬ 
clusions.  First,  when  we  partition  the  problem 
among  the  available  processors,  the  partitioning 
should  be  done  such  that  there  is  no  duplication  of 
design  descriptions  and  no  duplication  of  calculations, 
to  avoid  underutilizing  memory  and/or  processing 
power.  This  requirement  therefore  excludes  parti 
tioning  the  fault  list  among  the  processors,  because 
such  a  partitioning  requires  multiple  copies  of  the 
design  description  and  repeated  simulation  of  the 
same  patterns  on  the  fault  free  design.  In  addition,  if 
we  want  to  be  able  to  shift  jobs  from  one  processor 
to  another,  to  balance  the  load  among  the  processors, 
the  jobs  should  be  fairly  small.  This  excludes 
coarse-grained  parallelization  like  the  partitioning 
suggested  in  [II]. 

2.  Parallelizing  Fault  Simulation 

A  general  fault  simulation  algorithm  has  three  DO 
loops:  one  over  the  patterns  that  have  to  be  simu¬ 
lated,  one  over  the  faults  in  the  fault-list  and  one  over 
the  gates  in  the  design.  The  loop  over  the  patterns  is 
done  in  the  order  in  which  the  patterns  are  applied  to 
the  real  design,  while  the  loop  over  the  faults  can  be 
done  in  any  order.  The  loop  over  the  gates  is  done 
in  topological  order  [12].  This  is  defined  even  for 
sequential  designs  because  when  the  real  design  is 
tested  it  is  put  first  in  te.st  mode  [  1 .1].  In  this  mode, 
all  memory  elements  are  controDable  and  observable, 
and  the  design  is  reduced  to  a  collection  of  discon¬ 
nected  pieces  of  combinational  lo^c.  In  each  such 
piece,  there  is  a  trivial  partial  ordering  among  the 
gates:  gate  A  precedes  gate  B  when  B  is  in  the  down- 
cone  of  A. 

The  fault  simulation  algorithm  that  we  will  con¬ 
sider  in  this  article  is  (a  slightly  modified  version  of) 
Concurrent  Fault  Simulation  [14].  The  overall 
structure  of  this  algorithm  is  shown  in  table  I.  The 
calculation  at  each  gate  determines  which  faults 
produce  fault-effects  on  the  output  of  that  gate  for  the 
pattern  being  simulated.  Implicitly,  we  do  a  loop 
over  all  faults  in  the  fault-list,  as  indicated  in  the  pro¬ 
gram  fragment,  but  explicitly  we  only  consider  faults 
that  have  fault-effects  on  the  inputs  of  the  gate  or  that 
are  located  on  the  input  or  output  pins  of  the  gate. 
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DO  all  patterns; 

DO  all  gates  in  topological  order; 

DO  all  faults; 
code; 

END; 

END; 

END; 

I'able  1 :  1)0  loop  structure  of  Concurrent  I'ault 
Simulation 

The  innermost  DO  loop,  the  one  over  the  faults, 
is  treated  as  an  unbreakable,  atomic  unit.  It  is  the  job 
unit  out  of  which  the  parallel  fault  simulation  will  be 
built  up.  Iliis  partitioning  meets  the  various  re¬ 
quirements  mentioned  in  the  previous  section.  By 
partitioning  the  problem  into  disjoint  jobs,  there  is 
no  duplication  of  design  descriptions  and  no  dupli¬ 
cation  of  calculations.  In  addition,  the  atomic  jobs 
are  sufficiently  small  that  they  can  be  moved  around 
easily  when  required  to  maintain  load  balance.  The 
output  of  the  job  is  a  fault  table,  consisting  of  all  the 
faults  that  produce  fault-effects  on  the  output  of  the 
gate.  Their  sizes  range  from  1  to  several  thousand  in 
the  designs  that  we  considered.  When  the  node  where 
the  output  table  is  calculated  does  not  have  enough 
memory  to  store  the  table,  no  local  memory  overflow 
need  to  occur,  because  the  table  can  be  stored  on  an¬ 
other  node. 

Each  gate  needs  fault  tables  from  its  preceding 
gates  and  the  corresponding  job  can  therefore  not  be 
executed  before  all  jobs  corresponding  to  preceding 
gates  have  been  executed.  Gates  will  be  called  inde¬ 
pendent  when  they  are  not  in  each  other's  downconc. 
If  two  gates  are  independent,  neither  has  to  wait  for 
the  other  and  both  can  be  executed  in  parallel.  It  is 
this  parallelism  that  we  want  to  exploit  in  our  parallel 
fault  simulation.  A  rough  measure  of  the  number  of 
gates  that  can  be  treated  in  parallel  is  given  by  the 
width  of  the  design,  i.c.  the  ratio  of  the  total  number 
of  gates  in  the  design  and  the  average  number  of  gates 
between  a  controllable  input,  like  a  latch  or  a  I’l  ,  and 
an  observable  output,  like  another  latch  or  a  PO. 

3.  System  Hardware  and  Software 

I'he  parallel  I'ault  Simulation  algorithm  has  been 
implemented  on  the  IBM  VICTOR  V256  multi¬ 
processor  [15].  This  is  a  Transputer  [16,  17]  based 
mcs.sage  passing  machine.  V256  has  256  nodes,  each 
consisting  of  a  model  T800  Transputer  [18]  and  4 
Mbyte  of  l<jcal  memory.  Communication  between 


nodes  is  done  via  message  passing,  for  which  the 
Transputer  provides  substantial  hardware  support. 
Each  node  has  four  links  to  neighbors  and  the  nodes 
are  connected  in  a  mesh  topology.  The  bandwidth 
over  the  hard  links  between  neighbors  is  on  the  order 
of  1.5  Mbyte/sec.  More  details  about  the  hardware 
can  be  found  in  reference  [15]. 

V256  is  connected  to  a  host,  a  P(!/AT.  The  host 
has  an  additional  Transputer  on  which  the  host  pro¬ 
gram  runs.  This  host  program  supervises  the  fault 
simulation:  it  requests  information  from  the  user: 
which  design  to  simulate,  how  many  patterns  and 
which  patterns,  and  where  to  store  the  results, 
leading  the  simulation  programs  and  the  dc.sign  de¬ 
scriptions  onto  the  nodes  of  V256  is  done  from  the 
host  as  well. 

All  programs  arc  written  in  0('CAM  [19,  20],  a 
parallel  programming  language  that  implements 
CSP[2I].  All  nodes  run  the  same  program,  an  over¬ 
view  of  which  is  shown  in  Figure  1 .  The  program  is 
divided  in  two  processes,  that  run  in  parallel  but  with 
different  priorities  (PRl  PAR  in  f)(X'AM  language). 
The  high  priority  process,  above  the  dashed  line,  is  a 
router.  The  low  priority  process  is  below  the  dashed 
line,  and  consists  of  four  processes  that  run  in  parallel. 
The  actual  application  program  consists  of  the  ESIM 
and  CTE  processes.  MMU  is  a  memory  manager  and 
lO  is  an  input/output  interface  between  the  router 
and  the  other  low  priority  processes.  The  five  differ¬ 
ent  main  pr<iccsscs  shenvn  in  the  figure  communicate 
via  soft  channels,  indicated  by  the  dashed  arrows. 

The  function  of  the  K)  process  is  to  funnel  mes- 
.sages  from  the  application  and  the  mcmor>'  manager 
to  ofT-nodc  processes  through  the  on-node  router  and 
to  distribute  messages  from  off-node  processes  be¬ 
tween  the  mcmor>'  manager  and  the  application.  In 
addition,  the  l()  process  plays  a  crucial  role  in  gath¬ 
ering  fault  simulation  and  system  statistics,  since  all 
data  relevant  to  a  particular  node  passes  througli  it. 

1  he  task  of  the  memory  manager  is  to  provide  for 
several  system  services,  like  alloc.ating  and  freeing 
chunks  of  memory.  The  mcmoiy  manager  is  also 
used  to  obtain  memory  on  other  nodes  when  a  node 
has  run  out  of  space  in  its  own  local  memory,  (sec 
also  chapter  4.4). 

Both  on-nodc  and  off-node  memory  allocation 
requests  arc  handled  equally  by  the  memory  manager. 
They  share  the  same  rc.sourcc,  the  memory,  and  use 
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the  same  allocation/dcallocation  algorithms.  In  order 
to  reduce  the  number  of  messages  in  the  fault  simu¬ 
lator,  each  malloc  request  specifies  the  number  of  read 
requests  (life  span)  for  the  particular  fault  list,  which 
for  purely  combinational  logic,  is  known.  Iliis  ena¬ 
bles  the  memory  manager  to  free  the  allocated  mem¬ 
ory  block  automatically  after  receiving  the  right 


number  of  read  requests,  without  requiring  an  extra 
free  message  per  fault  list.  No  attempts  are  made  for 
garbage  collection  in  the  event  of  an  unsuccessful 
malloc  request  since,  in  most  cases,  the  performance 
penalty  would  not  justify  the  memory  gain.  A  differ¬ 
ent  approach  is  followed  to  insure  better  global 
memory  utilization  (chap  4.4). 


The  ta.sk  of  the  router  is  to  route  message  between 
diftcrent  nodes.  In  our  application,  messages  can 
have  varying  sizes.  Most  of  them  are  short,  10  -  100 
bytes,  but  some  of  them  can  be  very  long.  Table  2 
shows  some  message  statistics.  I  hcsc  statistics  were 
collected  on  a  .12  node  partition  of  V256,  by  fault 
simulating  five  random  patterns  on  one  of  the  sample 
designs  (C7522).  With  this  many  nodes,  fault  simu¬ 
lating  one  pattern  takes  about  0.7  seconds  during 
which  1 155  jobs  arc  executed  and  about  1200  faults 
out  of  7550  arc  caught.  The  columns  show  the  aver¬ 
age  size  of  message,  the  range  when  the  size  is  vari¬ 
able,  the  frequency  with  which  the  messages  arc  sent 
and  the  type.  I  hc  latter  can  be  n  to  n  (node  to  node). 


F'igurc  I :  Overview  of  a  single  node  process 


h  to  n  (host  to  node)  or  n  to  n/h  (node  to  node  or 
host).  I'ach  job  requires  the  sending  of  many  mes¬ 
sages;  on  average  2.4  request  messages  per  job  and  2.3 
result  messages  per  job. 

Messages  can  be  sent  from  any  node  to  any  other 
node.  I'ach  message  is  sent  as  a  utui  and  consists  of 
a  destination,  a  length  and  the  actual  message.  The 
application  program  is  responsible  for  providing  the 
destination  of  the  message  hut  is  not  involved  in  ac¬ 
tually  getting  Ihc  mcs.sagc  to  its  destination.  It  merely 
sends  the  message  to  its  on-node  router.  The  router 
procc.sscs  on  Ihc  nodes  then  cooperatively  route  the 
message  to  its  proper  destination. 


Message  Description 

Average 

size 

(bytes) 

Range 

min-max 

(bytes) 

Freq. 

per 

job 

Freq. 

per 

pattern 

Type 

input  pattern 

41.9 

28  -  72 

32 

h-n 

fault  list  request 

20 

2.4 

2724 

n-n 

fault  list 

63.0 

20-3280 

2.4 

2724 

n-n 

result  message 

24 

2.3 

2642 

n-n/)» 

I'ablc  2;  Message  characteristics 
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The  main  characteristics  of  the  router  are  that 
messages  are  sent  along  a  shortest  path  from  the 
sender  to  the  destination  and  that  the  routing  is  done 
in  a  strict  storc-and-forward  fashion.  The  router  reads 
the  destination  and  determines  whether  the  message 
has  to  be  forwarded  or  sent  to  the  application  pro¬ 
gram.  If  the  message  has  to  be  forwarded,  the  router 
determines  the  next  neighbor  to  which  to  send  the 
message  by  consulting  a  router  table.  The  contents 
of  the  router  tables  depend  on  the  node  on  which  the 
router  process  resides  and  on  the  destination.  They 
do  not  depend  on  the  sender's  address.  The  message 
is  then  forwarded  as  one  unit  and  stored  on  the 
neighbor  where  it  will  be  processed  in  a  similar  fash¬ 
ion  by  its  router. 

In  addition,  the  routing  is  done  in  a  deadlock-free 
fashion.  The  paths  that  a  message  can  take  arc  im¬ 
plicitly  stored  in  the  router  tables  and  arc  restricted  to 
make  the  routing  deadlock-free.  I'hc  deadlock 
avoidance  algorithm  that  we  implemented  is  the 
2-pIane  scheme  described  by  Yantchev  et  al.  [22]. 

4.  Simulation  Software 
4.1  Preprocessing 

We  parallelize  fault  simulation  by  assigning  gates 
in  the  design  to  the  nodes  in  the  parallel  processor. 
In  fact,  we  improve  the  performance  by  combining 
gates  in  small  single  output  clusters  and  assigning 
clusters  rather  than  individual  gates  to  the  processors. 
The  job  associated  with  a  cluster  is  the  calculation  of 
the  output  fault  table,  i.c.  the  table  of  faults  that 
produce  fault-cflccts  on  the  output  of  the  cluster.  I’hc 
input  to  a  job  arc  the  logic  values  on  the  inputs  of  the 
cluster,  the  logic  description  of  the  cluster,  faults  on 
the  input  or  output  pins  of  the  cluster  and  internal  to 
the  cluster  and  the  fault  tables  for  the  input  nets  to  the 
cluster. 

Ilow  the  assignment  of  clusters  to  processors  is 
done  strongly  influences  the  performance  of  the  sim¬ 
ulator.  Presently,  the  allocation  is  done  randomly. 
This  as.signmcnt  of  clusters  is  clearly  not  very  good 
from  a  communication  point  of  view:  the  averse 
distance  messages  have  to  travel  is  of  the  order  .Jn  , 
where  n  is  the  number  of  nodes  in  the  parallel  ma¬ 
chine.  Belter  assignment  algorithms  arc  possible,  but 
they  have  not  been  implemented  yet. 


4.2  Overall  Simulation  Flow 

The  simulation  of  the  clusters  that  are  assigned  to 
a  processor  is  controlled  by  the  controller  process 
C  TL,  shown  in  figure  I.  When  the  design  is  loaded 
into  the  parallel  processor,  the  controller  process  on 
each  node  builds  several  data  structures  and  initializes 
them.  After  initializing  the  data  structures,  the  con¬ 
troller  blocks,  waiting  for  an  input  pattern  message 
or  result  message.  I’hese  messages  indicate  that  the 
fault  table  for  some  line  that  is  an  input  to  a  cluster 
on  that  node  has  been  evaluated.  The  message  ^ves 
also  the  processor  number  where  the  fault  list  is  lo¬ 
cated.  These  messages  cause  an  update  to  take  place 
for  all  clusters  on  that  node  to  which  the  line  fans  out. 
For  each  cluster  a  counter  keeps  track  of  the  number 
of  input  stems  that  have  been  evaluated.  When  all  its 
inputs  have  been  evaluated,  a  cluster  is  put  in  a  FIFO 
buffer  to  be  simulated.  All  clusters  on  that  queue  will 
be  simulated  locally  except  when  dynamic  load  bal¬ 
ancing  modifies  this  allocation  (chapter  4..S). 

Clusters  are  taken  from  the  queue  and  sent  to  the 
actual  simulator,  FSIM  in  the  figure.  When  FSIM 
finishes  its  simulation,  it  sends  a  result  mcs.sage  to  the 
controller,  which  then  updates  its  data  stnietures. 
The  controller  also  forwards  the  result  message  to 
those  nodes  that  own  clusters  to  which  the  output  of 
the  simulated  cluster  fans  out.  I  hese  messages  will 
cause  further  updates  and  the  simulation  proceeds  as 
above,  fhe  simulation  of  a  pattern  terminates  when 
there  are  no  more  clusters  on  any  ready  queue  and 
no  more  processors  arc  working. 

Notice  that  this  method  guarantees  that  the  clus¬ 
ters  arc  processed  in  topological  order  even  when  they 
were  not  ordered  so  in  the  controller's  data  structures. 
Once  the  queue  contains  some  clusters,  the  mcch- 
ani.sm  of  sending  jobs,  receiving  result  messages,  put¬ 
ting  clusters  on  the  queue  and  taking  them  off  is  all 
that  is  required  to  keep  the  simulation  going.  This 
way,  the  fault  simulation  is  cast  into  a  Dataflow 
process,  with  the  result  messages  functioning  as  to¬ 
kens. 

To  make  the  controller  as  efficient  as  pnissiblc,  the 
data-structures  it  works  on  arc  tailor-made  for  emu¬ 
lating  the  Dataflow  process.  All  job  descriptions  are 
prc-stored  in  the  local  memory  of  the  controller  and 
filled  in  as  much  as  possible  when  the  design  dc- 
■scription  is  received  by  the  controller.  Once  a  cluster 
is  ready  to  be  processed,  the  job  description  can  be 
sent  to  I'SIM  without  any  further  alterations. 
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4.3  Fault  Simulation 


When  the  fault  simulation  process  FSIM  receives 
a  job  message  ,  it  parses  it  and  then  sends  out  requests 
for  the  input  fault  tables  that  it  needs.  While  waiting 
for  the  input  tables  to  return,  the  fault  simulation 
process  simulates  the  internal  cluster  faults.  The 
faults  in  the  input  fault  tables  are  processed  in  as¬ 
cending  order  of  fault  numbers,  so  that  a  sorted  fault 
table  is  produced  at  the  end.  riie  internal  faults  that 
were  caught  are  merged  into  the  final  list,  and  the  re¬ 
sulting  fault  table  is  stored  at  the  node  and  a  result 
message  is  sent  to  the  controller  process. 

4.4  Global  Memory  Utilization 

As  mentioned  previously,  a  good  parallcli7.ation 
strategy  should  properly  use  the  total  available  mem¬ 
ory  in  the  distributed  system.  This  is  a  trivial  re¬ 
quirement  to  fulfill  in  the  cases  where  memory 
utilization  patterns  are  "well-behaved",  that  is  com¬ 
pletely  known  at  compile-time.  This  is  typically  the 
case  in  regular  problems  such  as  matrix  manipu¬ 
lations,  image  processing  and  finite  element  method 
analysis,  fhe  problem  becomes  difficult  in  fault  sim¬ 
ulation  since  the  amount  of  memory  required  cannot 
be  detennined  a-priori,  and  fluctuates  widely  from 
one  test  pattern  to  another.  Ft  is  possible  for  the 
memory  of  one  processor  to  overflow  with  malloc 
requests,  while  the  memories  of  other  processors  arc 
under-utilized.  Such  nondctcrministic  behavior  could 
cause  a  particular  run  of  the  fault  simulation  to  abort 
while,  in  a  global  sense,  there  is  enough  memory  to 
accommodate  the  simulation.  Memory  compaction 
techniques  would  temporarily  alleviate  the  problem, 
at  the  expense  of  performance,  but  would  not  solve 
it. 

In  order  to  achieve  good  global  utilization  of  the 
available  memory,  the  following  mcmoiy'  overflow 
suppression  (MOS)  approach  is  taken.  Upon  an  un¬ 
successful  on-node  malloc  request,  the  MMU  issues 
an  off-node  request,  fhe  node  receiving  the  olT-nodc 
malloc  services  it  and  returns  an  acknowledge  message 
when  the  malloc  is  successful.  If  the  node  did  not 
have  enough  memory,  it  forwards  the  original  request 
to  another  processor.  The  scheme  proceeds  until  a 
success  is  reported  back  to  the  originator  of  the  re¬ 
quest,  which  then  .sends  its  fault  fable  to  the  node 
where  the  malloc  succeeded. 

The  efficiency  of  this  method  rests  on  the  algo¬ 
rithm  used  in  determining  the  processor  to 


send/fonvard  the  off-node  malloc  to.  In  the  current 
implementation,  the  next  proccs.sor  is  determined 
randomly  and  the  search  for  an  acceptor  is  terminated 
after  a  pre-fixed  number  of  trials. 

4.5  Load  Balancing 

Static  assignment  of  clusters  to  nodes  runs  the  risk 
of  severe  load  imbalance.  When  a  processor  has  no 
clusters  on  its  ready  queue  but  has  not  yet  processed 
all  its  clusters,  it  should  ask  other  processors,  accord¬ 
ing  to  .some  protocol,  for  work.  This  protocol  can 
be  the  same  as  the  one  used  to  find  off-node  storage. 
If  a  processor  receives  such  a  request,  and  if  it  has 
enough  clusters  on  its  ready  queue,  it  sends  a  job 
message  to  the  requestor.  Any  job  message  is  such 
that  the  I'SIM  procc.ss  that  handles  it  docs  not  need 
to  know  to  which  node  this  job  was  originally  allo¬ 
cated.  When  it  finished  its  job,  it  sends  a  result  mes¬ 
sage  back  to  its  controller.  Only  this  controller  knows 
whether  the  job  was  originally  assigned  to  it  node  or 
not.  In  the  latter  case  it  forwards  the  result  message 
to  the  controller  on  the  node  where  the  job  origi- 
naled.  The  latter  controller  then  handles  the  result 
message  in  the  usual  way. 

5.  Experimental  Results 

The  parallel  fault  simulation  program  has  been 
exercised  on  various  logic  de.signs.  We  will  discuss 
here  the  results  for  the  two  largest  ones.  The  circuit 
characteristics  for  tbc.se  two  designs  arc  given  in  table 
3.  ('7.^22  is  the  largest  design  in  the  ISCA.S  suite  of 
test  generation  benchmarks  [23].  DliSICiNA  is  an 
internal  design  and  is  almost  four  times  larger  than 
C:7522. 


5./  Overall  performance 

Table  4  shows  the  simulation  time  per  pattern, 
measured  by  simulating  five  random  input  patterns 
and  taking  their  average  simulation  time.  The  simu¬ 
lation  time  is  measured  from  the  moment  an  input 
pattern  is  send  to  V2.‘>6  to  the  moment  the  result 
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message  is  returned.  Clearly,  the  single  pattern  simu¬ 
lation  time  for  DBSIGNA  is  not  four  times  as  much 
as  it  is  for  ('7522.  To  understand  this  improved  per¬ 
formance  and  also  to  understand  the  actual  speedup, 
more  detailed  statistics  have  to  be  taken. 

Three  different  times  were  measured.  F'irst  of  all, 
the  time  to  complete  a  single  job,  i.e.  the  simulation 
of  one  cluster  including  the  requesting  and  receiving 
of  required  input  fault  tables.  These  times  arc  most 
conveniently  measured  in  the  CTL  process;  it  is  the 
time  lapse  between  the  sending  of  a  job  description 
to  rSlM  and  the  receiving  of  the  corresponding  result 
message.  Secondly,  we  want  to  know  how  much  time 
is  spent  idling,  i.e.  waiting  for  another  job  to  become 
ready.  Such  idling  occurs  when  a  proces.sor  .still  has 
some  clusters  to  process  but  all  of  them  need  input 
tables  that  have  not  been  computed  yet.  llicse  idle 
times  are  again  most  conveniently  measured  by  C  IT,. 

rinally,  we  want  to  know  how  much  time  I'SIM 
spends  waiting  between  sending  requests  for  ex¬ 
ternally  stored  fault  tables  and  receiving  them.  I’hesc 
wait  times  arc  measured  by  FSIM  itself.  In  fact, 
rSIM  measures  two  distinct  waiting  times.  The  first 
one  is  the  total  lap.se  between  sending  out  the  mes¬ 
sages  and  receiving  the  replies.  FSIM  docs  some 


work  after  requests  have  been  sent  out  and  the  time 
spent  doing  that  work  should  not  be  counted  as  wait 
time.  Fhc  second  lapse  time  is  the  real  wait  time,  i.e. 
the  time  between  finishing  this  additional  work  and 
receiving  the  fault  tables. 

Results  for  both  designs  arc  shown  in  table  4. 
Clearly,  the  main  difTcrcncc  between  the  two  designs 
is  in  the  average  job  times  and  the  number  of  idle 
periods.  I  hc  difference  in  average  job  time  results 
from  the  smaller  average  size  of  the  clusters  in 
DFSKiNA;  the  average  time  per  gate  is  roughly  the 
same  in  b<Uh  designs.  More  importantly,  OFSKiNA 
has  relatively  fewer  idle  periods  than  ('7522.  lliis  is 
to  be  expected,  bccau.se  a  large  number  of  idle  periods 
indicates  a  lack  of  parallelism,  which  in  larger  designs 
is  less  likely  than  in  smaller  ones.  In  fact,  in  ('7522 
the  idle  periods  account  for  about  25  %  of  the  total 
lapse  time,  while  in  DI'SKjNA  they  account  for  only 
12  %. 

Both  idle  times  and  wait  times  indicate  load  im¬ 
balance.  For  larger  designs  the  idle  times  and  the 
number  of  idle  periods  will  decrease  and  therefore  are 
of  no  real  concern.  Fhc  influence  of  the  waiting  times 
will  be  discussed  in  the  next  section. 


C7522 

DESIGNA 

Total  simulation  time  (secs.) 

t 

0.7 

2.0 

Average 

job  time  (msecs.) 

; 

9.50 

7.65 

Total  number  of  idle  per. 

271.2 

404 

Average 

idle  period  (msec.) 

21.57 

18.35 

Maximum 

idle  period  (msec.) 

194.50 

202.94 

Average 

request  time  (msec.) 

• 

5.83 

5.42 

Maximum 

request  time  (msec.) 

$ 

45.95 

53.63 

Average 

wait  time  (msecs.) 

t 

3.44 

3.47 

Maximum 

wait  time  (msecs.) 

; 

40.32 

46.72 

l  ablc  4;  Performance  statistics  per  pal  tern  (  .12  nodes) 

5.2  Performance  analysis 

The  wait  periods  depend  on  how  much  time  it 
takes  for  fault  list  requests  to  reach  their  destinations 
and  for  the  requested  fault  lists  to  travel  to  the  re¬ 
questing  node.  I'his  is  a  fundamental  problem.  The 
minimum  lime  any  simulator  needs  is  governed  by 
the  longest  paths  in  the  design.  As  no  node  on  any 
such  a  longest  path  can  be  simulated  before  the  pre¬ 
vious  node  on  that  path  has  been  simulated,  no 


parallelization  can  reduce  the  time  needed  the  simu¬ 
late  the  nodes  on  such  a  path.  With  the  random  as¬ 
signment  of  gales  to  processors  employed  here  request 
times  arc  proportional  to  d([’),  the  average  distance 
in  a  mesh  of  I’  nodes,  and  the  minimum  time  needed 
to  do  the  fault  simulation  will  therefore  nave  a  term 
proportional  to  d(I’)  as  well. 

The  total  lime  F  taken  by  the  fault  simulation 
depends  on  the  number  of  processors  1’,  the  number 
of  clusters  ('  anti  the  allocation  of  Ihc  clusters  to  Ihc 
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processors.  In  practice,  C  is  roughly  one  third  of  the 
number  of  gates  in  the  design.  The  time  needed  for 
the  simulation  of  one  cluster  is  roughly  the  sum  of  the 
time  needed  to  obtain  the  fault  tables  from  other 
nodes  and  the  time  needed  for  processing  these  fault 
tables.  When  P  becomes  large,  the  first  term  will 
dominate  and  we  will  focus  on  its  effects. 


-  24 

-  16 


We  therefore  find: 

d(P) 

Toc-^CS,  (1) 

where  we  assumed  that  there  is  perfect  load  balance. 
The  total  time  taken  on  one  processor  is  also  pro¬ 
portional  to  C  and  S,  and  the  resulting  speedup  is 
therefore  proportional  to  P/d(P).  With  the  random 
allocation  on  a  rectangular  mesh,  d(P)  is  roughly 
equal  to  2^ /3,  and  the  speedup  is  proportional  to 

7p- 


The  most  interesting  application  of  parallel 
processors  to  fault  simulation  occurs  when  we  let  P 
grow  linearly  with  C.  This  a  very  natural  thing  to  do, 
because,  when  C  increases,  the  amount  of  memory 
required  to  hold  the  design  description  has  to  increase 
as  well.  This  is  true  even  for  uni-processors.  In  a 
parallel  processor,  a  node  typically  has  a  fixed  amount 
of  memory  and  the  easiest  way  to  increase  the  total 
memory  is  therefore  to  increase  the  number  of  nodes 


P.  When  P  is  proportional  to  C,  the  total  time  taken 
for  the  fault  simulation  grows  only  as  SC"^  rather 
than  as  SC  when  done  on  a  uni-processor.  Note  also 
that  on  topologies  with  d(P)~  In  P  the  total  simu¬ 


lation  grows  only  as  S  In  C. 
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Figure  2:  Relative  performance 


Speedup  data  arc  plotted  in  figure  2  as  a  function 
of  P/d(P).  llic  figure  shows  that  for  a  range  of  P 
values,  the  speedup  behaves  roughly  as  P/d(P),  as  was 
found  in  the  previous  analy.sis.  For  small  P,  the 
.speedup  is  not  proportional  to  P/d(P)  because  the 
analysis  was  only  correct  for  large  P.  For  very  large 
P,  perfect  load  balancing  cannot  be  maintained,  be¬ 
cause  of  the  finite  degree  of  parallelism  in  the  design. 


However,  as  shown  by  the  figure,  larger  dcsifi 
their  parallelism  longer  than  smaller  ones. 


5 A  Scaled  speedup 


5.3  Speedup 

Figure  2  shows  the  speedup  as  function  of  the 
number  of  processors.  The  speedup  is  calculated  as 
follows.  First,  the  total  simulation  time  as  seen  from 
the  host  is  obtained.  'Fhe  speedup  is  then  measured 
by  dividing  the  total  simulation  time  at  some  fixed 
number  of  processors  by  the  simulation  time  at  the 
actual  number  of  proce.s.sors.  Bccau.se  these  designs 
arc  too  large  to  run  on  a  single  processor,  the  speedup 
with  respect  to  the  single  node  parallel  processor 
could  not  be  measured.  Instead,  the  speedup  is  cal¬ 
culated  relative  to  .12  nodes  and  the  speedup  at  32 
nodes  is  set  to  8.  This  arbitrary  speedup  was  obtained 
by  calculating  P/d(P)  (see  the  section  on  the  per¬ 
formance  analysis)  and  using  for  d(P)  the  value  for 
random  allocation  on  a  rectangular  mesh  (=4). 


Finally,  \vc  would  like  to  measure  the  actual 
speedup  that  is  feasible  with  this  parallelization.  The 
maximum  speedup  shown  when  all  256  nodes  arc 
u.scd  is  not  realistic  because  of  the  severe  underutili¬ 
zation  of  most  of  the  nodes.  We  therefore  consider 
the  simulation  times  at  the  number  of  processors 
where  the  speedup  curve  .starts  to  flatten  out  (32  for 
C7522  and  128  for  PFSIGNA).  We  could  not  run 
these  designs  on  a  single  node,  but  by  subtracting  the 
wait  times  from  the  average  job  times  and  then 
multiplying  the  result  by  the  number  of  jobs,  we  can 
estimate  how  long  the  fault  simulation  would  take  on 
a  uni-processor.  For  ('7522  we  find  about  7  seconds 
and  for  DF.SIGNA  about  22.8  seconds.  This  leads 
to  an  estimated  real  speedup  of  10  for  C7522  and  22 
for  DliSKJNA. 
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Abstract 

This  paper  explores  the  methods  used  to  parallelize 
NP-complete  problems  and  the  degree  of  improvement 
that  can  be  realized  using  a  distributed  parallel  proces¬ 
sor  (hypercube)  to  solve  these  combinatoric  problems. 
Common  characteristics  of  NP-complete  problems  are 
identified  and  the  set  covering  problem  (SCP)  is  chosen 
os  the  vehicle  for  exploration.  The  SCP  has  applica¬ 
tion  in  many  A  I,  communcations,  computer  science, 
and  control  problems  and  has  been  extensively  studied 
tn  the  serial  case  but  a  parallel  implementation  has 
not  been  reported.  The  design  process  states  the  ba¬ 
sic  algorithms  in  terms  of  UNITY  metaprograms  and 
iteratively  develops  three  increasingly  complex  paral¬ 
lel  versions  of  the  SCP:  coarse  grain/static  allocation, 
fine  grain/dynamic  allocation,  and  a  dynamic  load  bal¬ 
ancing  version.  A  speedup  is  obtained  in  each  of  five 
test  inputs  with  super-linear  speedup  obtained  in  four 
of  the  five  tests. 


Introduction 

This  paper  explores  the  methods  used  to  parallelize 
NP-complete  problems  and  the  degree  of  improvement 
that  can  be  realized  using  a  distributed  parallel  proces¬ 
sor  to  solve  these  combinatoric  problems.  Many  prob¬ 
lems  in  AI,  communications,  computer  science,  con¬ 
trol,  and  VLSI  involve  problems  that  reflect,  in  the 
worst  case,  an  enumeration  of  all  possible  paths  to  a 
solution;  that  is,  a  combinatoric  explosion  whose  as¬ 
sociated  solution  time  characteristics  is  bounded  by 
an  exponential  function.  General  examples  include; 
the  set  covering  problem  (SCP),  the  assignment  prob¬ 


lem,  and  the  traveling  salesman  problem  (TSP).  Serial 
solutions  to  these  problems  are  well  known  and  doc¬ 
umented  for  specific  cases  ets  well  as  for  the  general 
case  [2,  5,  8,  13].  Specific  parallel  implementations  of 
the  assignment  problem  and  the  traveling  salesperson 
problem  have  been  reported  [7,  10,  20,  11). 

In  the  following  sections,  a  brief  background  is  pre¬ 
sented  followed  by  a  discussion  of  the  parallel  solution 
techniques.  The  SCP  is  explained  and  three  paral¬ 
lel  SCP  algorithms  are  discussed.  The  final  sections 
present  the  performance  of  the  parallel  SCP  programs 
and  the  utility  of  the  parallel  programs. 

NP-Complete  Problems 

“It  is  an  unexplained  phenomenon  that  for  many  of 
the  problems  we  know  and  study,  the  best  algorithms 
for  their  solution  have  computing  times  which  clus¬ 
ter  into  two  groups”  [15].  The  solution  time  for  the 
first  group  of  problems  is  bounded  by  a  polynomial¬ 
time  function.  For  example,  sorting  —  O(nlogn),  bi¬ 
nary  searching  —  O(logn),  and  matrix  multiplication 

—  The  second  group  of  problems  are  those 
whose  best  known  algorithms  are  nonpolynomial.  For 
example,  the  TSP  —  0(n^2"),  0/1  knapsack  problem 

—  0(2'^),  and  the  SCP  —  0(2")  [15].  The  thrust 
of  this  paper  is  a  collection  of  problems  in  the  second 
class  termed  nondeterministic  polynomially-complete 
(NP-complete). 

All  NP-complete  problems  have  two  distinguishing 
characteristics.  First,  an  NP-complete  problem  must 
be  in  the  class  AfV.  Secondly,  any  NP-complete  prob¬ 
lem  must  be  transformable  to  all  other  NP-complete 
problems  in  polynomial  time  and  vice-versa  [2]. 
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Many  NP-complete  problems  exhibit  common  char¬ 
acteristics  which  can  be  exploited  or  have  an  impact 
on  a  parallel  implementation.  Such  characteristics  in¬ 
clude  polynomial-time  a  priori  reductions  which  poten¬ 
tially  reduce  the  input  problem,  graph  search,  collec¬ 
tion  and  use  of  partial  state  information  (determinis¬ 
tic/estimation),  and  the  unpredictable  nature  in  which 
an  NP-complete  search  progresses.  In  many  instances, 
the  a  priori  reductions  are  matrix  manipulation  oper¬ 
ations  and  experience  by  other  authors  [6,  12,  16,  21] 
has  shown  that  numerous  matrix  operations  can  be 
parallelized.  Hence,  it  is  reasonable  to  assume  that 
the  a  priori  reductions  can  be  efficiently  implemented 
on  a  parallel  computer.  NP-complete  search  methods, 
in  general,  utilize  partial  state  information  in  conjunc¬ 
tion  with  a  bounding  function  or  lower  bound  test  to 
improve  the  efficiency  of  the  search.  The  availability 
of  selected  partial  state  information  obtained  in  other 
processors  could  potentially  increase  the  efficiency  of 
such  bounding  functions.  Finally,  the  unpredictable 
nature  of  the  search  makes  a  coarse  grain  data  parti¬ 
tioning  algorithm  inefficient  for  most  problems;  there¬ 
fore,  some  method  of  dynamic  load  balancing  is  usu¬ 
ally  necessary  to  distribute  sections  of  the  search  tree 
to  idle  processors. 

To  study  the  parallelization  of  NP-complete  prob¬ 
lems  requires  the  selection  of  a  representative  problem 
which  is  proven  NP-complete.  The  SCP  was  chosen 
for  this  research  because  many  applications  such  as 
graph  coloring,  information  retrieval,  optimal  resource 
scheduling,  AI,  circuit  simulation,  operations  research, 
assignment  problems,  and  VLSI  logic  expression  sim¬ 
plification  can  be  structured  as  an  SCP  problem.  In 
addition,  its  generic  NP-complete  common  character¬ 
istics  are  well  documented  [8]  and  a  parallel  implemen¬ 
tation  h^ls  not  been  reported. 

Solution  Tschniques 

Parallel  programming  design  techniques  involve  de¬ 
composing  the  problem  and  developing  the  parallel  al¬ 
gorithms.  The  major  'omponents  of  a  parallel  solu¬ 
tion  are  developed  in  a  four  phase  process.  In  the 
first  phase,  a  meta-level  design  is  accomplished  using 
an  appropriate  design  language  such  as  UNITY  (Un¬ 
bounded  Nondeterministic  Iterative  Transformations). 
UNITY  is  a  design  syntax  developed  by  Chandy  and 
Misra  [6]  for  use  in  developing  parallel  programs.  It  is 
their  attempt  to  incorporate  a  formal  syntax  into  the 
parallel  program  design  process  and  is  similar,  in  many 
respects,  to  the  Hoare’s  [14]  method  of  designing  con¬ 
current  sequential  processes.  Both  methods  are  based 
on  predicate  calculus,  temporal  logic,  and  structured 


design  methods. 

The  second  and  third  phases  of  the  design  itera¬ 
tively  transform  the  UNITY  metaprograms  into  more 
complex  UNITY  representations  of  the  problem  until 
the  UNITY  metaprograms  are  sufficiently  developed 
to  map  directly  to  a  target  architecture.  The  algo¬ 
rithms  for  this  research  are  implemented  on  an  In¬ 
tel  iPSC/2  hypercube;  hence,  the  UNITY  design  is 
mapped  to  a  cube-connected  architecture.  The  final 
design  phase  is  to  conduct  a  complexity  analysis  of 
the  algorithms. 

Preprocessing 

In  many  instances,  the  efficiency  of  the  search  tech¬ 
niques  may  be  improved  through  the  use  of  precom¬ 
putation  or  preconditioning  in  the  form  of  a  priori  re¬ 
ductions  and  selective  bounding  functions  [5]. 

In  many  problems,  it  is  possible  to  reduce  the 
amount  of  searching  required  with  problem  specific  re¬ 
duction  techniques  or  precomputation  to  reduce  the 
dimensions  of  the  original  graph  or  tree  [5].  One  such 
reduction  is  to  remove  any  states  which  are  included  in 
every  branch  of  the  search  tree.  For  example,  consider 
a  tree  search  in  which  every  branch  contains  the  same 
node,  say  node  1.  Since  node  1  is  contained  in  every 
solution  to  the  search,  it  is  not  necessary  to  include 
this  node  in  every  search  path.  Rather,  the  node  is  re¬ 
moved  from  from  the  input  problem  and  retained  for 
later  insertion  into  the  final  solution. 

Dominance  testing  is  a  precomputation  method 
which  may  decrease  the  size  of  the  search  tree  by  com¬ 
paring  the  current  state  of  the  search  to  previously 
saved  states.  For  instance,  if  the  current  state  is  a 
subset  of  a  previous  state  and  the  current  state’s  cost 
is  greater  than  or  equal  to  the  previous  state’s,  then 
the  algorithm  can  backtrack.  This  technique  requires 
a  list  of  previous  states  be  maintained  in  some  suitably 
arranged  manner  (list)  to  allow  an  efficient  compari¬ 
son  to  the  current  state.  If  desired,  all  previous  states 
may  be  saved;  in  which  case,  this  approach  resem¬ 
bles  a  breadth-first  search  of  the  problem  space.  As 
with  most  engineering  problems,  some  tradeoff  must 
be  made  between  the  number  of  stored  previous  states 
and  the  computation  time  required  to  check  for  dom¬ 
inance  [9]. 

In  addition  to  dominance  testing,  the  computation 
of  a  lower  bound  is  sometimes  useful  in  bounding  the 
search.  A  lower  bound  is  the  lowest  possible  cost  down 
a  branch  of  the  search  tree.  Whether  or  not  the  lower 
bound  can  actually  be  obtained  is  irrelevant.  What 
does  matter  is  that  if  the  current  cost  plus  the  lower 
bound  exceeds  the  best  cost  obtained  thus  far,  the  al¬ 
gorithm  backtracks.  As  the  computation  of  the  lower 
bound  becomes  more  accurate,  more  branches  of  the 
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search  tree  are  pruned.  In  the  best  case,  the  lower 
bound  is  exact  and  the  search  proceeds  down  the  op¬ 
timum  path  without  backtracking.  It  is  natural  to 
assume  that  the  precision  of  the  lower  bound  compu¬ 
tation  is  inversely  proportional  to  the  amount  of  com¬ 
putation  time  required  to  compute  the  lower  bound. 
That  is,  a  precise  lower  bound  may  require  a  long  time 
to  compute.  Therefore,  a  suitable  lower  bound  com¬ 
putation  is  one  in  which  the  time  required  to  compute 
the  bound  does  not  adversely  impact  the  overall  search 
time  [5,  9]. 

Load  Balancing 

As  previously  stated,  NP-complete  problems  are  in¬ 
homogeneous;  therefore,  it  is  difficult  to  balance  the 
workload  between  autonomous  processors.  The  incor¬ 
poration  of  a  global  best  cost  which  is  known  by  all 
searching  processors  further  increases  the  likelihood 
of  a  load  imbalance  as  noted  by  Lai  and  Sahni  [17]. 
Hence,  load  balancing  is  an  integral  component  in 
any  parallel  implementation  of  an  NP-complete  search. 
The  load  balancing  may  take  the  form  of  static  or 
dynamic  allocation  of  subgraphs  or  it  may  be  in  the 
form  of  a  dynamic  load  balancing  scheme.  In  either 
instance,  the  necessity  to  balance  the  load  between 
processors  is  well  documented  [24,  20,  11,  19,  23,  18). 


The  Set  Covering  Problem 

The  set  covering  problem  (SCP)  is  one  of  a  large 
class  of  NP-complete  problems  [2]  extensively  stud¬ 
ied  in  the  late  1960’s  and  early  I970’s  in  connec¬ 
tion  with  operational  research  problems  such  as  air¬ 
line  and  assembly  line  scheduling,  design  of  computer 
systems,  crew  scheduling,  and  political  districting  are 
all  types  of  problems  which  can  be  formulated  as  an 
SCP  [9,  24,  3].  The  SCP  is  the  problem  of  finding 
the  minimum  number  of  columns  in  a  0-1  matrix  such 
that  all  rows  of  the  matrix  are  covered  by  at  least  one 
element  from  any  column  and  the  cost  associated  with 
the  covering  columns  is  optimal  (minimum  or  maxi¬ 
mum)  [8].  A  0-1  matrix  is  a  rectangular  matrix  in 
which  a  covered  row  is  denoted  by  a  ‘1’  in  the  covering 
columns.  If  the  rows  in  the  matrix  represent  the  ver¬ 
tices  of  a  graph,  the  existence  of  an  arc  between  any 
two  vertices  is  denoted  by  a  ‘1’  in  the  column  of  the 
matrix.  A  worst  case  search  requires  that  all  combi¬ 
nations  of  the  various  sets  be  check.  This  number  of 
combinations  is  the  power  set  or  2".  As  an  example, 
Figure  1  shows  a  0-1  matrix  in  which  the  rows  are 
covered  by  several  different  combinations  of  columns. 
The  worst  case  search  would  be  0(2*).  Columns  0,  1, 
2,  3,  and  4  form  a  cover  with  a  total  cost  of  27.  The 


optimal  cover  is  formed  by  columns  0,  3,  and  4  with  a 
cost  of  15. 
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Figure  1;  0-1  Matrix  [8] 


A  branch-and-bound  search  for  a  set  cover  attempts 
to  minimize  the  number  of  set  combinations  tested  in 
the  search  tree.  It  could  be  argued  that  all  optimal 
search  techniques  are  elaborate  bookkeeping  exercises. 
The  branch-and-bound  algorithm  must  store  the  tra¬ 
versed  states  so  they  can  be  recalled  during  the  back¬ 
tracking  phase.  Furthermore,  it  is  desirable  to  choose 
only  those  sets  which  actually  contribute  to  the  so¬ 
lution.  For  instance,  in  Figure  1,  suppose  the  search 
algorithm  has  chosen  sets  {0,  1}  to  cover  rows  {0,  1, 
3,  5}.  It  is  pointless  to  choose  set  {2}  since  it  will 
not  cover  any  rows  not  already  covered  by  sets  {0,  1}. 
Therefore,  the  efficiency  of  the  search  process  is  im¬ 
proved  if  there  exists  some  method  to  choose  the  next 
set  that  covers  rows  not  already  covered. 

Christofides  [8,  9]  suggests  the  construction  of  a  ta¬ 
ble  to  assist  in  the  bookkeeping  and  selection  of  the 
next  set.  The  construction  of  the  SCP  table  essentially 
preorders  the  rows  and  columns  of  the  input  matrix 
which  guides  the  search  in  an  efficient  manner.  The 
result  is  a  variation  of  a  best-first  search  without  the 
requirement  to  maintain  a  priority  queue  (“open  list”). 

The  algorithm  to  build  the  table  defines  a  block  for 
each  row  of  the  matrix.  All  columns  covering  a  par¬ 
ticular  row  are  contained  in  the  block  for  that  row.  A 
search  algorithm  which  selects  one  column  from  each 
block  is  guaranteed  to  cover  all  the  rows.  If,  in  addition 
to  just  selecting  columns  from  the  blocks,  the  search 
algorithm  keeps  track  of  the  rows  already  covered,  the 
algorithm  could  skip  blocks  which  correspond  to  rows 
already  covered.  The  search  progresses  from  left  to 
right  in  the  table  continually  selecting  and  marking 
one  column  from  each  block  as  necessary.  If  the  al¬ 
gorithm  must  backtrack,  it  regresses  from  right  to  left 
until  it  has  found  a  block  that  can  be  further  expanded. 
Notice,  also,  that  the  columns  within  each  block  are 
ordered  in  ascending  order  on  the  cost.  This  ordering, 
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in  most  cases,  decreases  the  number  of  expanded  nodes 
in  the  search  tree.  The  worst  case,  of  coarse,  requires 
that  all  columns  be  checked  before  the  optimal  solu¬ 
tion  is  found.  As  stated,  the  purpose  of  the  SCP  table 
is  to  assist  the  search  in  the  bookkeeping  and  selection 
of  the  next  column. 

The  SCP  may  be  defined  as  follows  [8]: 

Given  a  set  /?  =  {»’j,r2, . .  .,rm}  and  a 
family  C  =  {Si,S2, . .  ..Sat}  of  seta  such 
that  Sj  C  R,  any  subfamily  of  £  = 
{Sji,Sj2,. .  .,Sjk]  such  that 

k 

U5>.  =  fZ  (1) 

i=l 

is  called  a  set  covering  of  R. 

Given  a  0-1  matrix. 


Minimize; 

N 

^ 

Subject  to; 

i=i 

N 

^  ^ ~  1)2,.. 

.,m 

i=i 

That  is,  minimize  the  cost  such  that  all  the  elements 
of  R  are  covered  by  at  least  one  set  from  £. 

Since  the  parallel  SCP  programs  are  derived  in  part 
from  serial  SCP  programs,  a  serial  SCP  UNITY  pro¬ 
gram  is  developed  first  and  then  transformed  to  a 
parallel  UNITY  program.  The  first  parallel  UNITY 
program  is  not  specific  to  any  architecture;  hence,  an 
additional  iteration  is  performed  to  develop  an  archi¬ 
tecture  specific  UNITY  program.  Since  the  SCP  is 
implemented  on  an  iPSC/2  cube-connected  computer, 
the  UNITY  program  is  designed  to  take  advantage  of 
the  distributed  nature  of  a  cube-connected  computer. 
The  UNITY  programs  are  quite  extensive  and  are  not 
presented  here;  rather,  they  may  be  found  in  Beard  [4]. 
Following  the  development  of  each  UNITY  program, 
an  invariant,  a  fixed  point,  and  a  progress  condition 
are  derived  and  employed  to  prove  the  correctness  of 
the  UNITY  program.  One  of  the  strengths  of  UNITY 
is  the  ability  to  build  on  previous  proofs;  hence,  at  each 
iteration,  the  new  program  is  proven  correct  based  on 
the  proof  of  the  previous  program. 

Parallel  Algorithms 

For  the  major  SCP  process  components,  the  UNITY 
design  and  subsequent  translation  are  mapped  to  ap¬ 
propriate  algorithms.  These  components  include  two 
a  priori  reductions,  a  bitonic  merge  sort,  and  a  multi¬ 
faceted  search  technique.  The  a  priori  reductions 


are  divide-and-conquer  algorithms  using  a  logarith¬ 
mic  collection  technique,  the  bitonic  merge  sort  is  a 
generic  implementation  of  the  algorithm  presented  by 
Quinn  [21],  and  the  parallel  search  for  the  optimal 
set  cover  is  an  extension  of  the  branch-and-bound  al¬ 
gorithm  presented  by  Christofides  [8].  The  parallel 
search  utilizes  a  dominance  test,  a  lower  bound  test, 
and  a  global  best  cost  maintained  at  a  central  location 
for  distribution  to  ail  processors. 

As  with  the  development  of  the  UNITY  programs, 
a  serial  SCP  algorithm  is  developed  first  followed  by 
the  development  of  three  increasingly  complex  parallel 
algorithms.  Each  new  version  of  the  parallel  SCP  is 
based  on  the  previous  version  and  is  aimed  at  reducing 
individual  processor  idle-time.  The  first  parallel  imple¬ 
mentation  of  the  SCP  employs  a  coarse  grain  algorithm 
with  static  allocation  of  the  search  space.  The  second 
version  is  a  fine  grain  algorithm  with  dynamic  alloca¬ 
tion  of  the  search  space,  and  the  third  version  is  a  fine 
grain  algorithm  with  the  addition  of  a  dynamic  load 
balancing  technique  in  which  the  searching  processors 
nondeterministically  share  portions  of  the  search  tree. 

All  three  parallel  algorithms  employ  a  common  con¬ 
trol  structure.  Processor  0  is  reserved  as  a  con¬ 
troller  with  the  remaining  processors  executing  only 
the  search  algorithm.  The  controller  receives  the  input 
matrix  from  the  host  processor  and,  following  any  user 
requested  reductions,  it  coordinates  a  parallel  bitonic 
merge  sort  of  the  rows  and  columns  and  sends  the  data 
to  the  searching  processors.  Depending  on  the  par¬ 
ticular  algorithm,  as  discussed  below,  it  may  or  may 
not  partition  the  input  state  space.  In  either  case, 
it  functions  as  the  central  repository  for  the  globally 
maintained  best  cost  and  the  corresponding  list  of  cov¬ 
ering  sets.  As  the  searching  processors  search  their 
respective  search  trees,  they  compare  their  local  best 
cost  against  their  copy  of  the  global  best  cost.  If  a 
searcher  finds  a  better  solution  than  its  copy  of  the 
current  global  best  solution,  it  submits  the  solution 
to  the  controller.  The  controller  compares  all  received 
costs  against  its  current  global  cost  and  retains  the  bet¬ 
ter.  If  a  new  global  best  cost  is  received,  this  cost  is 
broadcast  to  all  searching  processors  for  use  in  bound¬ 
ing  their  search  trees.  The  following  sections  describe 
the  three  parallel  algorithms. 

Coarse  Grain/Static  Allocation 

The  coarse  grain  algorithm  is  the  simplest  of  the 
three.  Once  the  searching  processors  receive  the  sorted 
input  matrix,  each  processor  builds  a  copy  of  the  SCP 
table  and  expands  the  state  space  with  the  results  of 
the  expansion  stored  in  a  queue.  The  expansion  algo¬ 
rithm  is  a  simple  breadth-first  expansion  of  the  state 
space  which  divides  the  search  tree  by  first  inserting 
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the  level  1  nodes  into  a  queue.  If  necessary,  the  ex¬ 
pansion  algorithm  continues  to  expand  the  tree  by  re¬ 
moving  the  top  entry  from  the  queue  and  expanding 
it  to  the  next  level.  The  expansion  is  complete  when 
the  number  of  subtrees  is  greater  than  the  number  of 
searching  processors  or  a  preset  number  of  subtrees 
exist. 

As  stated,  the  same  expansion  algorithm  is  executed 
by  all  searching  processors;  therefore,  the  queue  on 
each  searcher  is  identical.  This  duplication  is  not  nec¬ 
essarily  globally  efficient  but  the  algorithm  is  simple  to 
develop  and  duplicate  on  all  processors.  Following  the 
initial  expansion  of  the  search  tree,  the  searching  pro¬ 
cessors  remove  subtrees  from  the  queue  based  on  their 
processor  ID.  For  example,  given  a  queue  with  three 
entries  and  two  searching  processors,  processor  1  re¬ 
moves  the  first  item  from  the  queue  and  processor  2 
removes  the  second  item  from  the  queue.  When  fin¬ 
ished  with  the  current  subtree,  processor  1  removes 
the  third  subtree  from  the  queue.  Processor  2  sits  idle 
following  the  completion  of  its  search. 

In  this  load  balancing  scheme,  the  initial  search  tree 
is  divided  a  predetermined  number  of  times  and  all 
searching  processors  may  or  may  not  receive  a  subtree 
to  search.  The  subtrees  are  statically  allocated  to  the 
searching  processors  since  the  allocation  of  subtrees  to 
processors  is  determined  in  the  algorithm  (i.e.,  ‘hard- 
coded’). 

As  one  might  suspect,  much  time  is  wasted  by  idle 
processors  and  the  workload  is  far  from  balanced  for 
most  problem  instances.  Therefore,  the  coarse  grain 
algorithm  is  modified  to  decrease  the  processor  idle¬ 
time. 

Fine  Grain/Dynamic  Allocation 

The  initial  expansion  for  the  fine  grain  algorithm 
is  quite  similar  to  the  breadth-first  expansion  for  the 
coarse  grain  algorithm.  One  major  difference  is  worth 
noting.  In  an  attempt  to  converge  on  the  optimal 
solution  quicker,  the  new  expansion  algorithm  com¬ 
bines  both  a  breadth-first  and  a  depth-first  expansion. 
Given  the  matrix  preordering  and  the  construction  of 
the  SCP  table  previously  explained,  it  is  likely  that 
the  optimal  solution  to  the  SCP  lies  in  the  left-most 
portion  of  the  search  tree.  Therefore,  at  each  level  in 
the  search  tree,  the  expansion  algorithm  accomplishes 
a  breadth-first  expansion  on  the  left-most  node  in  the 
search  tree. 

The  dynamic  allocation  portion  of  the  search  at¬ 
tempts  to  decrease  overall  processor  idle-time.  The 
expansion  algorithm  is  moved  from  the  searching  pro¬ 
cessors  to  a  controlling  processor  (supervisor).  The 
subtrees  are  constructed  and  the  controller  assigns 
each  processor  a  initial  subtree.  As  processors  finish, 


they  are  assigned  another  subtree  from  the  queue  un¬ 
til  all  the  queue  is  empty.  At  which  point,  increasing 
numbers  of  processors  become  idle  during  the  wind-up 
phase  of  the  search.  Intuitively,  this  search  algorithm 
should  perform  better  than  the  coarse  grain  algorithm 
with  static  allocation  since  processor  idle-time  will  de¬ 
crease. 

Dynamic  Load  Balancing 

The  dynamic  load  balancing  version  of  the  SCP  be¬ 
gins  as  a  fine  grain  parallel  algorithm  and  enters  a  dy¬ 
namic  load  balancing  process  when  all  subgraphs  have 
been  distributed.  Upon  completion  of  the  fine  grain 
distribution  of  subgraphs,  the  controller  triggers  the 
active  participation  of  a  separate  process  called  the  to¬ 
ken  process.  The  token  process  exists  on  all  processors 
and  its  only  function  is  to  coordinate  the  dynamic  load 
balancing  scheme.  A  token  is  circulated  (ring)  through 
all  nodes  in  the  cube  and  is  composed  of  a  linear  array 
of  m  integers  where  m  is  the  number  of  processors  in 
the  user  acquired  cube.  The  first  integer  in  the  to¬ 
ken  (Token[0])  denotes  the  number  of  searchers  still 
searching  and  is  included  to  facilitate  quick  checking 
of  the  status  of  the  search.  The  remaining  integers  in 
the  array  are  used  by  the  searchers  to  indicate  whether 
they  are  working  or  idle. 

The  token  process  located  on  the  controller  (i.e., 
processor  0),  initally  examines  the  first  element  in  the 
token  to  determine  if  any  searchers  are  still  searching. 
If  all  searchers  are  waiting  for  another  subgraph  (To- 
ken[0]  =  0),  the  search  is  complete  and  the  token  pro¬ 
cess  notifies  all  searching  processors.  If  any  searcher 
is  still  working  (Token[0]  ^  0),  the  token  is  passed  un¬ 
changed  to  the  next  node  in  the  ring. 

Each  token  process  residing  on  the  searching  pro¬ 
cessors  continually  monitors  its  receive  buffer  for  the 
presence  of  the  token,  a  request  from  another  proces¬ 
sor,  or  a  message  stating  that  the  current  processor 
is  idle.  Should  the  token  arrive  and  the  processor  is 
idle,  the  token  is  updated  and  the  token  process  cycles 
down  the  linear  array  until  it  finds  a  processor  still 
searching  or  it  has  polled  all  active  processors. 

If  the  token  process  finds  a  working  processor,  it  re¬ 
quests  a  subgraph  from  that  processor.  The  requesting 
token  process  communicates  with  the  to»:en  process  lo¬ 
cated  on  the  active  processor.  The  searching  processes 
are  never  allowed  to  communicate  with  each  otiitr  All 
dynamic  load  balancing  is  coordinated  through  the  to¬ 
ken  process.  The  result  is  an  efficient,  simple,  and 
highly  reusable  dynamic  load  balancing  scheme. 

The  token  processes  coordinate/control  the  dynamic 
load  balancing  process;  however,  the  searching  process 
is  responsible  for  partitioning  the  search  tree  for  shar¬ 
ing.  Two  problems  must  be  addressed.  First,  the  shar- 
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ing  algorithm  must  ensure  completion  of  the  search.  In 
other  words,  a  race  condition  must  not  develop  where 
the  same  subtree  is  continually  passed  between  pro¬ 
cessors.  Second,  it  is  desirable  to  limit  the  amount  of 
unnecessary  sharing. 

The  race  condition  is  easily  prevented.  A  search¬ 
ing  process  is  only  allowed  to  share  after  it  has  back¬ 
tracked.  This  requirement  forces  the  searching  process 
to  expand  at  least  one  branch  in  the  search  tree  and, 
since  any  shared  subtree  consists  of  unexpanded  nodes, 
a  race  condition  can  not  occur. 

To  limit  the  amount  of  sharing,  the  algorithm  which 
partitions  the  subtrees  is  designed  to  partition  the 
largest  possible  subtree.  When  a  subtree  is  requested, 
an  active  searching  process  backtracks  through  the 
search  tree  until  it  finds  the  highest  expandable  node. 
The  beginning  of  this  new  subtree  is  marked  so  that  it 
will  not  be  searched  by  the  current  processor  and  the 
(largest  expandable)  subtree  is  given  to  the  token  pro¬ 
cess  which  transmits  it  to  the  requesting  token  process. 


Performance 

A  serial  version  and  the  three  parallel  algorithms  for 
the  SCP  were  executed  on  a  32-node  Intel  iPSC/2  com¬ 
puter.  Since  the  clock  on  the  host  measures  relative 
process  time  and  the  node  processor’s  clock  measures 
absolute  time,  on  would  expect  the  serial  program  ex¬ 
ecution  time  to  be  equivalent  on  both  the  node  proces¬ 
sor  and  the  host.  In  most  cases,  the  host  search  time 
was  longer  probably  due  to  the  context  switching  of 
other  user  processes.  Therefore,  in  order  to  accurately 
measure  the  speedup,  an  optimized  serial  version  of  the 
SCP  is  executed  on  one  of  the  node  processors  and  all 
time  associated  with  communications  is  removed  from 
the  search  time.  A  comparison  between  the  serial  pro¬ 
gram  time  executing  on  a  node  processor  and  the  par¬ 
allel  program  time  decreases  the  measured  speedup, 
but  is  a  more  accurate  reflection  of  the  actual  speedup 
since  the  effects  of  the  hardware  and  operating  system 
are  essentially  eliminated. 

Test  Problems 

Twenty- four  0-1  test  matrices  are  generated  to  val¬ 
idate  the  effectiveness  of  the  SCP  algorithms.  Five  of 
these  matrices  require  in  excess  of  two  hours  to  solve 
by  the  serial  algorithm  and  are  used  to  measure  the 
efficiency  of  the  three  parallel  SCP  algorithms.  The 
density  of  the  matrices  is  defined  as  the  total  number 
of  I’s  divided  by  the  total  number  of  matrix  elements 
and  the  columns  are  all  assigned  unit  cost  since  unit 
cost  problems  are  usually  more  difficult  to  solve.  Three 
of  the  five  test  cases  (matrices  #1,  #2,  and  #3)  are 


lOOx  100  matrices  with  densities  of  0.28,  0.27,  and  0.26 
respectively.  Matrix  is  75  x  125  with  a  density  of 
0.25  and  matrix  #5  is  70  x  70  with  a  density  of  0.08. 

Performance  Metrics 

For  the  purpose  of  this  paper,  the  performance  met¬ 
rics  of  primary  concern  are  the  search  time  and  the 
total  execution.  The  search  time  is  used  to  compute 
the  speedup  and  the  total  execution  time  is  used  to 
measure  the  maximum  processor  idle-time.  Clearly, 
other  metrics  are  of  interest  such  as  the  number  of 
expanded  nodes,  the  time  spent  sorting  and  prepro¬ 
cessing  the  input  data,  the  time  expended  by  the  dy¬ 
namic  load  balancing  programs,  number  of  times  the 
global  best  cost  was  updated,  and  the  time  when  the 
global  best  cover  was  last  updated.  These  metrics  are 
not  considered  here  but  a  discussion  may  be  found  in 
Beard  [4]. 

Results 

Much  of  the  justification  for  implementing  three  dif¬ 
ferent  parallel  versions  of  the  SCP  is  based  on  reducing 
individual  processor  idle-time.  The  results  indicate 
that,  in  general,  the  coarse  grain  algorithm  incurred 
the  most  idle-time  (85-1065  seconds)  followed  by  the 
fine  grain  algorithm  (152-144  seconds)  and  then  the 
dynamic  load  balanced  (DLB)  algorithm  (19-42  sec¬ 
onds);  however,  these  numbers  are  deceiving.  If  one 
were  to  judge  the  parallel  algorithms  based  solely  on 
processor  idle-time,  the  DLB  algorithm  is  clearly  the 
most  efficient  algorithm  and  the  fine  grain  algorithm  is 
usually  better  than  the  coarse  grain  algorithm.  Such 
a  conclusion  is  invalid. 

Figures  2,  3,  4,  5,  and  6  show  the  normalized 
speedups  obtained  for  each  of  the  five  test  matrices 
described  above  for  1-31  searching  processors.  The 
legend  is  displayed  to  the  right  of  each  graph  and  is 
defined  as  follows;  L  —  linear  speedup,  eg  —  coarse 
grain  algorithm  with  static  allocation,  fg  —  fine  grain 
algorithm  with  dynamic  allocation  and  lb  —  dynamic 
load  balanced  algorithm. 

Interpretation  of  the  data  from  three  of  the  five  test 
cases  (Figures  2,  3,  and  5)  show  that  even  though 
the  dynamic  load  balancing  is  effective  in  balancing 
the  load,  the  additional  processing  necessary  to  ac¬ 
complish  this  may  actually  increase  the  overall  search 
time  such  that  the  dynamic  load  balancing  algorithm 
is  slower  than  the  fine  grain  algorithm.  Furthermore, 
three  of  the  test  cases  (Figures  4-6)  indicate  that  the 
coarse  grain  expansion  algorithm  is  more  efficient  than 
the  fine  grain  expansion  algorithm.  Notice  also  that 
four  of  the  five  test  cases  (Figures  3-6)  exhibited  super- 
linear  speedup  with  the  search  of  matrix  #4  shown  in 
Figure  5  obtaining  the  largest  speedup  at  61! 
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These  seemingly  erratic  results  are  easily  explained. 
Recall  that  the  coarse  grain  and  the  fine  grain  ver¬ 
sions  of  the  SCP  use  different  breadth-first  expansion 
algorithms.  Given  that  NP-complete  problems  are  in¬ 
homogeneous,  the  different  expansion  algorithms  pro¬ 
duce  radically  different  search  graphs;  hence,  the  dif¬ 
ference  in  performance  between  the  coarse  grain  and 
fine  grain  algorithms  is  unpredictable.  In  fact,  one 
could  argue  that  the  two  algorithms  are  searching  en¬ 
tirely  differently  since  the  search  graphs  produced  for 
the  same  problem  are  different. 

The  expansion  algorithm  affects  both  the  solution 
time  and  the  idle-time;  however,  the  processor  idle¬ 
time  is  affected  more  by  the  allocation  of  the  initial 
subgraphs  than  by  the  expansion  algorithm.  Since 
subgraphs  in  the  coarse  grain  algorithm  are  statically 
allocated,  processor  idle-times  are  typically  longer 
with  the  coarse  grain  algorithm  than  with  the  dynam¬ 
ically  allocated  fine  grain  algorithm.  Even  so,  an  ex¬ 
amination  of  the  raw  data  reveals  that  the  coarse  grain 
expansion  algorithm  usually  results  in  a  quicker  best 
cover  time  (i.e.,  it  finds  the  optimal  cover  before  the 
fine  grain  expansion  algorithm). 

The  DLB  algorithm  was  developed  to  further  de¬ 
crease  maximum  processor  idle-time  and  to  improve 
the  efficiency  of  the  search  algorithm.  However,  Fig¬ 
ure  2  shows  that  the  fine  grain  algorithm  is  consistently 
faster  than  the  DLB  algorithm.  Either  the  DLB  is  in¬ 
efficient  or  the  fine  grain  algorithm  is  highly  efficient 
for  this  problem  instance.  In  this  particular  problem, 
the  fine  grain  expansion  algorithm  balances  the  load 
from  the  beginning  of  the  search.  Any  additional  load 
balancing  (e.g.,  dynamic  load  balancing)  simply  steals 
CPU  cycles  from  the  search  algorithm  and  delays  the 
completion  of  the  search. 

The  DLB  algorithm  is  not  necessarily  inefficient; 
however,  it  does  include  additional  code  to  dynam¬ 
ically  share  portions  of  a  processor’s  search  graph. 
Even  though  the  timing  data  obtained  from  the  node 
processors  indicates  an  extremely  small  percentage  of 
time  devoted  to  the  dynamic  load  balancing  process, 
the  data  does  not  show  the  total  processor  time  de¬ 
voted  to  the  token  process.  This  time  is  significant  in 
some  problem  instances  as  shown  in  Figures  2,  3,  4, 
and  5.  In  each  of  these  graphs,  the  speedup  of  the 
fine  grain  algorithm  closely  parallels  the  speedup  of 
the  DLB  algorithm.  Furthermore,  notice  that  the 
fine  grain  algorithm  is  frequently  more  efficient  than 
the  DLB  algorithm  even  though  the  performance  data 
from  the  searching  processors  indicates  the  DLB  algo¬ 
rithm  did  in  fact  share  subgraphs  between  searchi.ng 
processors.  Two  reasons  for  the  DLB’s  apparent  inef¬ 
ficiency  are;  1)  the  token  process  is  stealing  too  much 
time  from  the  search  process,  2)  the  searching  proces¬ 


sors  are  spending  too  much  time  partitioning  and  send¬ 
ing  subgraphs  to  other  processors.  Unfortunately,  the 
mclockO  function  does  not  provide  a  method  to  com¬ 
pute  the  CPU  time  consumed  by  the  separate  token 
process;  hence,  another  method  must  be  found  to  mea¬ 
sure  process  time.  The  second  reason  suggests  that  the 
processors  are  partitioning  the  subgraph  at  too  low  a 
level  in  the  search  tree  and  a  heuristic  algorithm  is 
required  to  prevent  such  low-level  partitioning. 

Despite  the  previous  figures.  Figure  6  shows  that  the 
DLB  algorithm  does  work.  For  this  specific  problem, 
the  fine  grain  expansion  algorithm  only  creates  19  sub¬ 
graphs  due  to  limitations  in  the  expansion  algorithm. 
In  effect,  this  is  a  coarse  grain  partitioning  of  the  initial 
search  graph.  Since  only  19  subgraphs  are  developed, 
the  processors  quickly  become  idle  and  the  efficiency 
of  the  search  suffers.  With  the  DLB  algorithm,  the 
idle  processors  immediately  receive  a  subgraph  from 
the  working  processors  and  contribute  to  the  search. 
Had  the  dynamic  load  balancing  algorithm  not  been 
effective,  the  DLB’s  speedup  curve  would  have  paral¬ 
leled  the  fine  grain  algorithm’s  curve  as  in  previous 
graphs. 

Summary  of  Results 

The  SCP  has  application  in  solving  many  ‘real- 
world’  and  NP-complete  problems.  For  example,  air¬ 
line  and  assembly  line  scheduling,  design  of  computer 
systems,  railroad-crew  scheduling,  and  political  dis¬ 
tricting  are  all  types  of  problems  which  can  be  formu¬ 
lated  as  an  SCP  [9,  24,  3].  Furthermore,  since  the  SCP 
is  an  NP-complete  problem,  it  can  be  used  to  solve 
other  NP-complete  problems  such  as  the  assignment 
and  graph  coloring  problems  after  proper  transforma¬ 
tion.  The  key  to  applying  the  SCP  to  any  of  these 
problems  is  to  identify  the  items  that  must  be  covered 
by  some  subset  of  another  list  of  items.  Once  the  two 
lists  of  items  are  identified,  they  must  be  formulated  as 
a  0-1  matrix  with  the  items  to  be  covered  as  the  rows 
and  the  covering  items  as  the  columns.  Additionally, 
the  covering  items  must  have  some  associated  cost  to 
identify  their  relative  importance. 

One  of  the  objectives  of  this  research  was  to  inves¬ 
tigate  methods  to  parallelize  NP-complete  problems. 
Three  methods  are  presented  and  a  speedup  is  ob¬ 
tained  for  each.  In  fact,  a  super-linear  speedup  is 
obtained  for  four  of  the  five  test  matrices.  The  pos¬ 
sibility  of  super-linear  speedup  in  branch-and-bound 
search  problems  was  predicted  by  Lai  and  Sahni  [17] 
but  it  is  unclear  whether  anyone  had  confirmed  this 
phenomenon  via  the  test  results  from  an  actual  im¬ 
plementation.  This  is  not  to  say  that  the  algorithms 


developed  for  this  research  routinely  produce  a  super- 
linear  speedup.  On  the  contrary,  one  could  develop 
many  test  cases  which  would  quickly  disprove  such  a 
statement.  However,  the  algorithms  presented  here 
show  a  tendency  to  go  super-linear  for  input  test  cases 
that  require  a  substantial  amount  of  time  to  solve  with 
a  serial  algorithm.  More  research  is  required  to  ascer¬ 
tain  whether  specific  problem  characteristics  can  be 
a  priori  exploited  to  obtain  predictable  super-linear 
speedup. 

The  performance  increases  presented  here  are  the 
result  of  a  different  approach  than  that  documented 
in  much  of  the  published  literature  [18,  22,  1,  19,  11]. 
The  typical  approach  to  parallelizing  an  NP-complete 
problem  seems  to  center  around  the  existence  of  a  cen¬ 
trally  maintained  priority  queue  containing  unsolved 
subbranches.  The  processors  receive  a  subgraph,  fur¬ 
ther  partition  the  subgraph,  and  then  transmit  the 
newly  partitioned  subgraphs  back  to  the  centrally 
maintained  queue.  Such  an  approach  is  communica¬ 
tions  intensive  as  shown  by  Quinn  [22].  The  approach 
presented  here  is  to  partition  the  search  space  first  and 
distribute  the  subgraphs  to  the  individual  processors. 
As  such,  the  communications  overhead  becomes  in¬ 
significant  and  the  problem  becomes  compute  bound. 
This  simple  but  elegant  approach  to  the  initial  load 
balancing  is  only  possible  because  of  the  preordering 
(i.e.,  the  construction  of  the  SCP  table)  accomplished 
before  the  search.  The  result  is  a  simple  and  highly 
efficient  initial  distribution  of  the  load  for  many  prob¬ 
lem  instances.  The  possibility  of  a  similar  preordering 
in  other  NP-complete  problems  is  left  for  future  re¬ 
searchers. 

To  date,  much  of  the  research  into  parallel  branch- 
and-bound  algorithms  has  focused  on  the  traveling 
salesman  problem.  The  research  presented  here  con¬ 
tains  the  first  known  parallel  implementation  of  the 
SCP.  Given  the  general  application  of  the  SCP  to 
many  different  problems  and  the  results  published  in 
this  document,  applications  based  on  a  parallel  SCP 
(e.g.,  weapon  to  target  assignment,  optimal  resource 
scheduling,  VLSI  expression  simplification,  and  infor¬ 
mation  retrieval)  could  achieve  considerable  perfor¬ 
mance  increases.  Furthermore,  the  methods  presented 
here  show  that  it  is  possible  to  realize  a  performance 
increase  using  control  and  data  structures  centered 
around  something  other  than  a  centrally  maintained 
priority  queue. 

The  results  further  indicate  that  the  performance 
of  a  parallel  NP-complete  search  is  highly  dependent 
on  the  method  chosen  to  distribute  or  balance  the  load 
between  the  processors.  The  initial  distribution  of  sub¬ 
graphs  accomplished  by  the  parallel  SCP  algorithms, 
in  many  of  the  test  cases,  is  sufficient  to  ensure  a  ‘good’ 


load  balancing.  However,  as  Figure  6  indicates,  a  dy¬ 
namic  load  balancing  algorithm  is  necessary  in  those 
instances  where  the  initial  distribution  fails  to  obtain 
the  desired  load  balance.  The  dynamic  load  balancing 
algorithm  developed  for  this  research  is  a  much  sim¬ 
pler  algorithm  than  those  presented  by  Felten  [1 1]  or 
Ma  [18].  The  algorithm  employs  a  separate  process  to 
pass  a  token  between  the  processors  and  to  coordinate 
all  load  balancing.  The  separate  token  process  is  de¬ 
signed  such  that  termination  is  easily  detected  and,  in 
the  absence  of  any  other  load  balancing  scheme,  the 
DLB  algorithm  may  provide  acceptable  performance. 

The  concepts  used  to  develop  the  data  and  control 
structures  for  this  design  lend  themselves  very  effi¬ 
ciently  to  solving  general  NP-complete  problems  in 
an  effective  manner.  These  concepts  include  the  di¬ 
vision  of  data  and  control  for  the  search  algorithms, 
as  well  as  the  load  balancing  algorithms  required  to 
achieve  the  most  productivity  from  every  node  proces¬ 
sor.  For  the  a  priori  reductions,  the  data  is  partitioned 
out  to  the  processors  where  a  reduction  is  performed 
on  the  reduced  problem.  The  results  of  the  individual 
reductions  are  recombined  in  neighboring  processors 
and  further  reduced.  Each  node  search  is  essentially 
a  serial  algorithm  searching  a  reduced  section  of  the 
tree  with  knowledge  of  the  lowest  cost  obtained  by  all 
processors.  Finally,  the  inhomogeneous  nature  of  NP- 
complete  problems  forces  the  development  of  a  load 
balancing  algorithm.  Many  such  algorithms  are  possi¬ 
ble;  however,  the  designer  must  balance  the  amount  of 
time  required  to  load  balance  against  the  time  required 
to  complete  the  search.  The  dynamic  load  balancing 
algorithm  developed  for  this  SCP  research  is  simple, 
reusable,  and  effective. 
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tatives:  Tony  Anderson,  Ray  Asbury,  Sean  Griffin,  and 
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1  Introduction 

The  so-called  Assignment  Problem  is  of  considerable 


importance  in  a  variety  of  applications,  and 
stated  as  follows.  Let 

can  be 

A  =  {ai.aa, 

.  ..,ayv^} 

(1) 

and 

•  •  >  Vs  } 

(2) 

be  two  sets  of  items  and  let 

dij  =  d[ai,bj]  >  0, 

Oi  €  A,  bj  e  B 

(3) 

be  a  measure  of  the  distance  (dissimilarity)  between 
individual  items  from  the  two  lists.  Taking  Na  <  Nb, 
the  objective  of  the  assignment  problem  is  to  find  the 
particular  mapping 

» t- n(i),  1  <  i  <  1  <  n(i)  <  Vfl  (4) 

i^j=>  n(i)  #  n(j)  (5) 

such  that  the  total  association  score 

Na 

Stot  =  (6) 

«=i 

is  minimized  over  all  permutations  11. 

For  Na  <  Nb,  the  naive  (exhaustive  search)  com¬ 
plexity  of  the  assignment  problem  is  0[Nb'/{Nb  — 
Nyi)!].  There  are,  however,  a  variety  of  exact  solutions 
to  the  assignment  problem  with  reduced  complexity 
0[N^Nb],  (Refs. [1-3]).  Section  2  briefly  describes  one 
such  method,  Munkres  Algorithm  [2],  and  presents  a 
particular  sequential  implementation.  Performance  of 
the  algorithm  is  examined  for  the  particularly  nasty 
problem  of  associating  lists  of  random  points  within 
the  unit  square.  In  Section  3,  the  algorithm  is  gen¬ 
eralized  for  concurrent  execution,  and  performance 
results  for  runs  on  the  Marklll  hypercube  are  pre¬ 
sented. 


2  The  Sequential  Algorithm 

The  input  to  the  assignment  problem  is  the  matrix 
D  =  {d,j}  of  dissimilarities  from  Eq.(3).  The  first 
point  to  note  is  that  the  particular  assignment  which 
minimizes  Eq.(6)  is  not  altered  if  a.  fixed  value  is  added 
to  or  subtracted  from  cill  entries  in  any  row  or  column 
of  the  cost  matrix  D.  Exploiting  this  fact,  Munkres 
solution  to  the  Assignment  Problem  can  be  divided 
into  two  parts 

Ml  ;  Modifications  of  the  distance  matrix  D  by 
row/column  subtractions,  creating  a  (large) 
number  of  zero  enties. 

M2  :  With  {^^(i))  denoting  the  row  indices  of  all 
zeros  in  column  i,  construction  of  a  so-called 
Minimal  Representative  Set,  meaning  a  distinct 
selection  Rzii)  for  each  i,  such  that  i  j 
Rz{i)  ^  Rzii)- 

The  steps  of  Munkres  algorithm  generally  follow  those 
in  the  constructive  proof  of  P.  Hall’s  theorem  on  Min¬ 
imal  Representative  Sets. 

The  preceding  paragraph  provides  a  hopelessly  in¬ 
complete  hint  as  to  the  number  theoretic  basis  for 
Munkres  Algorithm.  The  particular  implementation 
of  Munkres  algorithm  used  in  this  work  is  as  de¬ 
scribed  in  Chapter  14  of  Ref.[3].  To  be  definite,  take 
Na  £  Nb,  and  let  the  columns  of  the  distance  matrix 
be  associated  with  items  from  list  A.  The  first  step  is 
to  subtract  the  smallest  item  in  each  column  from  all 
entries  in  the  column.  The  rest  of  the  algorithm  can 
be  viewed  as  a  search  for  special  zero  entries  (starred 
zeros  Z‘),  and  proceeds  as  follows: 

Munkres  Algorithm 

Step  1  :  Setup 

1.  Find  a  zero  Z  in  the  distauice  matrix. 

2.  If  there  is  no  starred  zero  already  in  its  row 
or  column,  star  this  zero. 
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3.  Repeat  steps  1.1,  1.2  until  all  zeros  have 
been  considered. 

Step  2  :  Z*  Count,  Solution  Assessment. 

1.  Cover  every  column  containing  a  Z* . 

2.  Terminate  the  algorithm  if  all  columns  are 
covered.  In  this  case,  the  locations  of  the  Z” 
entries  in  the  matrix  provide  the  solution  to 
the  assignment  problem. 

Step  3  :  Main  Zero  Search 

1.  Find  an  uncovered  Z  in  the  distance  matrix 
and  prime  it,  Z  y-<-  Z' .  If  no  such  zero  exists, 
go  to  Step  5 

2.  If  No  Z*  exists  in  the  row  of  the  Z' ,  go  to 
Step  4. 

3.  If  a  Z*  exists,  cover  this  row  and  uncover 
the  column  of  the  Z' .  Return  to  Step  3.1 
to  find  a  new  Z. 

Step  4  ;  Increment  Set  Of  Starred  Zeros 

1.  Construct  the  ‘Alternating  Sequence’  of 
primed  and  starred  zeros: 

Zq  :  Unpaired  Z'  from  Step  3.2. 

Zi  :  The  Z"  in  the  column  of  Zq 

Z^N  '■  The  Z'  in  the  row  of  Z‘2n-i,  «/such 
a  zero  exists. 

Z2N+1  ■  The  Z*  in  the  column  of  Z^s- 

the  sequence  eventually  terminates  with  an 
unpaired  Z'  =  Z^y  for  some  N. 

2.  Unstar  each  starred  zero  of  the  sequence. 

3.  Star  each  primed  zero  of  the  sequence,  thus 
increasing  the  number  of  starred  zeros  by 
one. 


Step  1 
Initialization 


Step  2 

Z*  Count,  End  ? 


Buring  'L 


Step  3 
Zero  Search 


Interesting  Z  1 


No  L 


Steps 

Zero 

Manufacture 


Step  4 

Z'  ->  Z*  Swaps 


Figure  1:  Flowchart  for  Munkres  algorithm 


A  (very)  schematic  flowchart  for  the  algorithm  is 
shown  in  Fig.(l).  Note  that  Steps  1,5  of  the  algo¬ 
rithm  overwrite  the  original  distance  matrix. 

The  preceeding  algorithm  involves  flags  (starred  or 
primed)  associated  with  zero  entries  in  the  distance 
matrix,  as  well  as  ‘Covered’  tags  associated  with  in¬ 
dividual  rows  and  columns.  The  implementation  of 
the  zero  tagging  is  done  by  first  noting  that  there  is 
at  most  one  Z”  or  Z'  in  any  row  or  column.  The 
covers  and  zero  tags  of  the  algorithm  are  accordingly 
implemented  using  five  simple  arrays: 

CC{k)  :  Covered  column  tags,  1  <  I:  <  ^'coI.s■ 

CR(j)  :  Covered  row  tags,  I  <  j  <  Nnoiys 

ZS(k)  :  Z’  locators  for  columns  of  the  matrix.  If 
positive,  ZS(it)  is  the  row  index  of  the  Z*  in  the 
k"'  column  of  the  matrix. 


4.  Erase  all  primes,  uncover  all  columns  and 
rows,  and  return  to  Step  2, 

Step  5  :  New  Zero  Manufactures 

1.  Let  h  be  the  smallest  uncovered  entry  in  the 
(modified)  distance  matrix. 

2.  Add  h  to  all  covered  rows. 

3.  .Subtract  h  from  all  uncovered  coluniiis 

4.  Return  to  Step  3,  without  altering  stars, 
primes  or  covers. 


ZR(j)  :  Z'  locators  for  rows  of  the  mat  rix.  If  pos¬ 
itive,  Zf{(y)  is  the  column  of  the  Z‘  in  the  j'^ 
row  of  the  matrix. 

ZP(j)  :  Z'  locators  for  rows  of  the  matrix.  If  posi¬ 
tive,  ZP(j')  is  the  column  of  the  Z'  in  the  j'^  row 
of  the  matrix. 

Entries  in  the  rover  arrays  CC  and  CR  are  one  if  the 
row  or  column  is  covered  zero  otherwise  Entries  in 
the  zero-locator  arrays  ZS,  ZR  and  ZR  are  zero  if  no 
zero  of  the  appropriate  type  exists  in  the  indexed  row 
or  column. 
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Sequential  Timings 


Na  =  Nb 


ecu  ted.  The  190x190  case  involves  6109  entries  into 
Step  3  and  593  entries  into  Step  5. 

Since  the  zero  searching  in  Step  3  of  the  algorithm 
is  required  so  often,  the  implementation  of  this  step 
is  done  with  some  care.  The  search  for  zeros  is  done 
column-by-column,  and  the  code  maintains  pointers 
to  both  the  last  column  searched  and  the  most  re¬ 
cently  uncovered  column  (Step  3.3)  in  order  to  reduce 
the  time  spent  on  subsequent  re-entries  to  the  Step  3 
box  of  Fig.(l). 

The  dashed  line  if  Fig. (2)  indicates  the  nominal 
AT  oc  scaling  predicted  for  Munkres  algorithm. 
By  and  large,  the  timing  results  in  Fig. (2)  are  consis¬ 
tent  with  this  expected  behavior.  It  should  be  noted, 
however,  that  both  the  nature  of  this  scaling  and  the 
coefficient  of  are  very  dependent  on  the  nature  of 
the  data  sets.  Consider,  for  example,  two  identical 
trivial  lists 


a,-  =  6,  =  i,  \  <  i  <  N  (7) 


Figure  2;  Timing  results  for  the  sequential  algorithm 
versus  problem  size 

With  the  Star- Prime-Cover  scheme  of  the  pre- 
ceeding  paragraph,  a  sequential  implementation  of 
Munkres  algorithm  is  completely  straightforward.  At 
the  beginning  of  Step  1,  all  cover  and  locator  flags 
are  set  to  zero,  and  the  initial  zero  search  provides  an 
initial  set  of  non-zero  entries  in  ZS().  Step  2  sets  ap¬ 
propriate  entries  in  CC()  to  one  and  simply  counts  the 
covered  columns.  Steps  3  and  5  are  trivially  imple¬ 
mented  in  terms  of  the  Cover/Zero  arrays  and  the  'Al¬ 
ternating  Sequence’  for  Step  4  is  readily  constructed 
from  the  contents  of  ZS(),  ZR()  and  ZP(). 

As  an  initial  exploration  of  Munkres  algorithm, 
consider  the  task  of  associating  two  lists  of  random 
points  within  a  2D  unit  square,  taking  the  cost  func¬ 
tion  in  Eq.(3)  to  be  the  usual  Cartesian  distance.  Fig- 
ure(2)  plots  total  CPU  times  for  execution  of  Munkres 
algorithm  for  equal  size  lists  versus  list  size.  The  ver¬ 
tical  axis  gives  CPU  times  in  seconds  for  one  node 
of  the  Markin  hypercube.  The  circles  and  crosses 
show  the  time  spent  in  Steps  5  and  3,  respectively. 
These  two  steps  (zero  search  and  zero  manufacture) 
account  for  essentially  a// of  the  CPU  time.  For  the 
190x  190  case,  the  total  CPU  time  spent  in  Step  2  was 
about  0.9  CPU  see,  and  that  spent  in  Step  4  was  too 
small  to  be  reliably  measured.  The  large  amounts  of 
time  spent  in  Steps  3  and  5  arise  from  the  very  large 
numbers  of  times  these  parts  of  the  algorithm  are  ex- 


with  the  distance  between  items  given  by  the  absolute 
value  function.  For  the  datasets  in  Eq.(7),  the  prelim¬ 
inaries  and  Step  1  of  Munkres  algorithm  completely 
solve  the  association  in  a  time  which  scales  as  . 
In  contrast,  the  random  point  association  problem  is 
a  much  greater  challenge  for  the  algorithm,  as  nomi¬ 
nal  pairings  indicated  by  the  initial  nearest-neighbor 
searches  of  the  : .  liu.  <':*y  step  are  tediously  undone 
in  the  creation  .  t)'*’  ,a  .  case-like  sequence  of  zeros 
needed  for  Step  -4  .\s  brief,  instructive  illustration 
of  nature  of  this  processing.  Fig. (3)  plots  the  CPU 
time  Per  Step  for  the  last  passes  through  the  outer 
loop  of  Fig.(l)  for  the  150x150  eissignment  problem 
(recall  that  each  paiss  through  the  outer  loop  increases 
the  count  by  one).  The  processing  load  per  step 
is  seen  to  be  highly  non-uniform. 

3  The  Concurrent  Algorithm 

The  timing  results  from  Fig. (2)  clearly  dictate  the 
manner  in  which  the  calculations  in  Munkres  algo¬ 
rithm  should  be  distributes  among  the  nodes  of  a  hy- 
percubc  for  concurrent  execution.  The  zero  and  min¬ 
imum  element  searches  for  Steps  3  and  5  are  the  most 
time  consuming  and  should  be  done  concurrently.  In 
contrast,  the  essentially  bookkeeping  tasks  a.ssociated 
with  Steps  2  and  4  require  insignificant  CPU  time  and 
are  most  naturally  done  in  lockstep  (i.e,,  all  nodes  of 
the  hypercube  perform  the  same  calculations  on  the 
same  data  at  the  same  time).  The  details  of  the  con¬ 
current  algorithm  are  as  follows. 
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step 

Figure  3:  Times  per  loop  (i.e.,  N[Z’]  increment)  for 
the  last  several  loops  in  the  solution  of  the  150x150 
problem. 

Data  Decomposition 

The  distance  matrix  {d<;}  is  distributed  across  the 
nodes  of  the  hypercube,  with  entire  columns  assigned 
to  individual  nodes.  (This  assumes,  effectively,  that 
^COLS  ^  ^NODESt  which  is  always  the  case  for  as¬ 
signment  problems  which  are  big  enough  to  be  ‘inter¬ 
esting’.)  The  cover  and  zero  locator  lists  defined  in 
Section  2  are  duplicated  on  all  nodes. 

Task  Decomposition 

The  concurrent  implementation  of  Step  5  is  partic¬ 
ularly  trivial.  Each  node  first  finds  its  own  minimum 
uncovered  value,  setting  this  value  to  some  ‘infinite’ 
token  if  all  columns  assigned  to  the  node  are  covered. 
A  simple  loop  on  communication  channels  determines 
the  global  minimum  among  the  node-by-node  mini¬ 
mum  values,  and  each  node  then  modifies  the  contents 
of  its  local  portion  of  the  distance  matrix  according 
to  Steps(5.2,5.3). 

The  concurrent  implementation  of  Step  3  is  just 
slightly  more  awkward.  On  entry  to  Step  3,  each  node 
searches  for  zeros  according  to  the  rules  of  Section  2, 
and  fills  a  3-element  status  list: 

£[j]  =  i[Node^]  =  {S,  k/toWr/^coi)  (8) 


where  S  is  a  zero-search  status  flag, 

{—1  No  Z  was  found 

0  Z  with  Z*  in  row  (Boring)  (9) 
1  Z  without  Z*  (Interesting) 

If  the  status  is  non-negative,  the  last  two  entries  in 
the  status  list  specify  the  location  of  the  found  zero. 
A  simple  channel  loop  is  used  to  collect  the  individual 
status  lists  of  each  node  into  all  nodes,  and  the  action 
taken  next  by  the  program  is  as  follows: 

•  If  all  nodes  give  negative  status  (no  Z  found),  all 
nodes  proceed  to  Step  5. 

•  If  any  node  gives  status  1,  all  nodes  proceed  to 
Step  4  for  lockstep  updates  of  the  zero  location 
lists,  using  the  row-column  indices  of  the  node 
which  gave  status  1  as  the  starting  point  for 
Step  4.1.  If  more  than  one  node  returns  status  1 
(highly  unlikely,  in  practice),  only  the  first  such 
node  (lower  node  number)  is  used. 

•  If  all  zeros  uncovered  are  ‘Boring’,  the  cover¬ 
switching  in  Step  3.3  of  the  algorithm  is  per¬ 
formed.  This  is  done  in  lockstep,  processing  the 
Z’s  returned  by  the  nodes  in  order  of  increas¬ 
ing  node  number.  Note  that  the  cover  rearrange¬ 
ments  performed  for  one  node  may  well  cover  a  Z 
returned  by  a  node  with  higher  node  number.  In 
such  cases,  the  nominal  Z  returned  by  the  later 
node  is  simply  ignored. 

It  is  worth  emphasizing  that  only  the  actual  searches 
for  zero  and  minimum  entries  in  Steps  3  and  5  are 
done  concurrently.  The  updates  of  the  cover  and  zero 
locator  lists  are  done  in  unison. 

The  concurrent  algorithm  has  been  implemented  on 
the  Markin  hypercube,  and  has  been  tested  against 
random  point  association  tasks  for  a  vMiety  of  list 
sizes.  Before  examining  results  of  these  tests,  how¬ 
ever,  it  is  worth  noting  that  the  concurrent  implemen¬ 
tation  is  not  particularly  dependent  on  the  hypercube 
topology.  The  only  conunun’cation-dependent  parts 
of  the  algorithm  are 

1.  Determination  of  the  ensemble-wide  minimum 
value  for  Step  5. 

2.  Collection  of  the  local  Step  3  status  lists  (Eq.(9). 

either  of  which  could  be  easily  done  for  almost  any 
MIMD  architecture. 

Table  1  presents  performance  results  for  the  asso¬ 
ciation  of  random  lists  of  200  points  on  the  Marklll 
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N 

Nodes] 

1 

2 

4 

8 

T 

Total] 

68.08 

38.79 

23.11 

16.40 

T 

Step  3] 

19.63 

13.09 

9.69 

8.00 

T 

Step  5] 

44.99 

22.99 

11.79 

6.16 

^Total 

- 

0.878 

0.736 

0.519 

^Step  3 

- 

0.750 

0.506 

0.307 

^Step  5 

- 

0.978 

0.954 

0.913 

N[Step  3] 

2029 

1430 

1134 

991 

N 

Nodes] 

1 

2 

4 

8 

T 

Total] 

654.83 

372.70 

205.48 

119.25 

T 

Step  3] 

183.80 

128.04 

81.59 

56.66 

T 

Step  5] 

462.06 

237.54 

117.39 

57.94 

f  Total 

- 

0.878 

0.800 

0.686 

^Step  3 

- 

0.718 

0.563 

0.405 

^Step  5 

- 

0.973 

0.984 

0.997 

N[Step  3] 

7075 

4837 

3483 

2778 

Table  1;  Concurrent  performance  For  200x200  ran-  Table  2:  Concurrent  performance  For  100x100  ran¬ 
dom  points  dom  points 


hypercube  for  various  cube  dimensions.  (For  consis¬ 
tency,  of  course,  the  same  input  lists  are  used  for  all 
runs.)  Time  values  are  given  in  CPU  seconds  for  the 
total  execution  time,  as  well  as  the  time  spent  in  Steps 
3  and  5.  Also  given  are  the  standard  concurrent  exe¬ 
cution  efficiencies, 

Tf  INode] 

^  iV  X  T[  A  Nodes  ]  ' 

as  well  as  the  numbers  of  times  the  Step  3  box  of 
Fig.(l)  is  entered  during  execution  of  the  algorithm. 
The  numbers  of  entries  into  the  other  boxes  of  Fig.(l) 
are  independent  of  the  hypercube  dimension. 

There  is  an  aspect  of  the  timing  results  in  Table  1 
which  should  be  noted.  Namely,  essentially  all  inef¬ 
ficiencies  of  the  concurrent  algorithm  are  associated 
with  Step  3  for  2  Nodes  compared  to  Step  3  for  I 
Node.  The  times  spent  in  Step  5  are  approximately 
halved  for  each  increase  in  the  dimension  of  the  hy¬ 
percube.  However,  the  efficiencies  associated  with  the 
zero  searching  in  Step  3  are  rather  poorer,  particularly 
for  larger  numbers  of  nodes. 

At  a  simple,  qualitative  level,  the  inefficiencies  asso¬ 
ciated  with  Step  3  are  readily  understood.  Consider 
the  task  of  finding  a  single  zero  located  somewhere 
inside  an  AT  x  N  matrix.  The  mean  sequential  search 
time  is 

{'^Searchfl  (H) 

since,  on  average,  half  of  the  entries  of  the  matrix  will 
be  examined  before  the  zero  is  found.  Now  consider 
the  same  zero  search  on  two  nodes.  The  node  which 
has  the  half  of  the  matrix  containing  the  zero  will  find 
it  in  about  half  the  time  of  Eq.(ll).  However,  the 
other  node  will  always  search  through  all  of  its  N  x 
N/2  items  before  returning  a  null  status  for  Eq.(9). 
Since  the  node  which  found  the  zero  must  wait  for  the 
other  node  before  the  (lockstep)  modifications  of  zero 


locators  and  cover  tags,  the  node  without  the  zero 
determines  the  actual  time  spent  in  Step  3,  so  that 

(TSearchl  2  Nodes  ])  «  (Tg^archt  1  Node  ]}  (12) 

In  the  full  program,  the  concurrent  bottleneck  i.s 
not  as  bad  as  Eq.(12)  would  imply.  As  noted  above, 
the  concurrent  algorithm  can  process  multiple  ‘Bor¬ 
ing’  Z’s  in  a  single  pass  through  Step  3.  The  frequency 
of  such  multiple  Z’s  per  step  can  be  estimated  by  not¬ 
ing  the  decreasing  number  of  times  Step  3  is  entered 
with  increasing  hypercube  dimension,  as  indicated  in 
Table  1.  Moreover,  each  node  maintains  a  counter 
of  the  last  column  searched  during  Step  3.  On  subse¬ 
quent  re-entries,  columns  prior  to  this  marked  column 
are  searched  for  zeros  only  if  they  have  had  their  cover 
tag  changed  during  the  prior  Step  3  processing.  While 
each  of  these  dgorithm  elements  does  diminish  the 
problems  associated  with  Eq.(12),  the  fact  remains 
that  the  search  for  zero  entries  in  the  distributed  dis¬ 
tance  matrix  is  the  least  efficient  step  in  concurrent 
implementations  of  Munkres  algorithm. 

The  results  presented  in  Table  1  demonstrate  that 
an  efficient  implementation  of  Munkres  algorithm  is 
certainly  feasible.  It  is  next  interesting  to  examine 
how  these  efficiencies  change  as  the  problem  size  i.s 
varied . 

The  results  shown  in  Tables  2,3  demonstrate  an  im¬ 
provement  of  concurrent  efficiencies  with  incre2ising 
problem  size  -  the  expected  result.  For  the  100x100 
problem  on  8  nodes,  the  efficiency  is  only  about 
SOproblem  is  too  small  for  8  nodes,  with  only  12  or  13 
columns  of  the  distance  matrix  assigned  to  individual 
nodes. 

While  the  performance  results  in  Tables  1-3  are  cer¬ 
tainly  acceptable,  it  is  nonetheless  interesting  to  in¬ 
vestigate  possible  improvements  of  efficiency  for  the 
zero  searches  in  Step  3.  The  obvious  candidate  for 
an  algorithm  modification  is  some  sort  of  checkpoint- 
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Table  3:  Concurrent  performance  For  300  x  300  ran¬ 
dom  points 


N 

Nodes] 

1 

2 

4 

8 

T 

Total] 

2046.91 

1154.27 

622.53 

353.30 

T 

Step  3] 

585.61 

399.41 

235.49 

154.57 

T 

Step  5] 

1442.22 

742.90 

377.89 

188.59 

^Total 

- 

0.887 

0.822 

0.728 

^Step  3 

- 

0.733 

0.621 

0.473 

^Sten  5 

- 

0.971 

0.954 

0.956 

N[Step  3] 

13250 

8583 

5785 

4365 

ing  :  at  intermediate  times  during  the  zero  search, 
the  nodes  exchange  a  ‘Zero  Found  Yet  ?’  status  flag, 
with  all  nodes  breaking  out  of  the  zero  search  loop  if 
any  node  returns  a  positive  result. 

For  message  passing  machines  such  as  the  Marklll, 
the  checkpointing  scheme  is  of  little  value,  as  the  time 
spent  in  individual  entries  to  Step  3  are  not  enormous 
compared  to  the  node-to-node  communication  time. 
For  example,  for  the  2-node  solution  of  the  300x300 
problem,  the  mean  time  for  a  single  entry  to  Step  3 
is  only  about  46  msec,  compared  to  a  typical  node- 
to-node  communications  time  which  can  be  a  signif¬ 
icant  fraction  of  a  millisecond.  The  time  required  to 
perform  a  single  Step  3  calculation  is  not  large  com¬ 
pared  to  node-to-node  communications.  As  a  (not 
unexpected)  consequence,  all  attempts  to  improve  the 
Step  3  efficiencies  through  various  ‘Found  Anything 
?’  schemes  were  completely  unsuccessful. 

The  checkpointing  difficulties  for  a  message-passing 
machine  could  disappear,  of  course,  on  a  shared  mem¬ 
ory  machine.  If  the  zero-search  status  flags  for  the 
various  nodes  could  be  kept  in  memory  locations  read¬ 
ily  (i.e.,  rapidly)  accessible  to  all  nodes,  the  problems 
of  the  preceding  paragraph  might  be  eliminated.  It 
would  be  interesting  to  determine  whether  significant 
improvements  on  the  (already  good)  efficiencies  of  the 
concurrent  Munkres  algorithm  could  be  achieved  on 
a  shared  memory  machine. 
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Abstract 

This  research  is  concerned  with  approximation  algo¬ 
rithms  for  NP-hard  optimization  problems  on  hyper- 
cube  multiprocessors.  We  investigate  methods  of  solv¬ 
ing  such  problems,  focusing  on  the  tradeoffs  in  running 
time,  number  of  active  nodes,  input  size,  and  accuracy 
of  solution.  In  this  paper,  we  consider  a  tiered  algo¬ 
rithm  framework  that  describes  our  level  algorithms 
and  we  expand  upon  the  2-dimcnsional  bin  packing 
results  given  in  [4].  The  major  contributions  of  this 
paper  are  data  structures  which  dramatically  improve 
the  run  time  of  the  first  fit  and  best  fit  algorithms  pre¬ 
sented  in  [4].  The  results  in  this  paper  were  obtained 
on  a  32  node  Intel  iPSC/2. 

Introduction 

A  variety  of  important  industrial  optimization  prob¬ 
lems  are  known  to  be  NP-hard,  which  implies  that 
we  should  not  expect  to  find  efficient  (i.e.,  polynomial 
time)  algorithms  yielding  optimal  solutions  to  these 
problems  for  all  input  sets.  (These  problems  include 
packing  items  on  trucks,  scheduling  jobs  on  a  com¬ 
puter  system,  and  a  variety  of  stock-cutting  problems, 
to  name  a  few.)  In  fact,  if  P  ^  NP  then  even  a  poly¬ 
nomial  number  of  processors  (i.e.,  polynomial  in  the 
size  of  the  input)  cannot  be  used  to  produce  efficient 
solutions  to  NP-hard  problems. 

For  NP-hard  problems,  researchers  typically  study 
approximation  algorithms  which  attempt  to  find  a 
nearly  optimal  solution  in  an  acceptable  amount  of 
time.  Recently,  this  study  has  included  algorithms  for 
multiple  processor  machines.  Parallel  approximation 
algorithms  for  the  traveling  salesperson  problem  are 
given  in  (12,  5],  for  the  O/I  knapsack  problem  in  (10], 
for  the  2-iimensional  bin-packing  problem  in  (4),  and 
for  the  multiprocessor  scheduling  problem  in  [3]. 

This  paper  focuses  on  the  2-dimensional  bin  pack¬ 
ing  problem,  which  is  often  referred  to  as  the  rectangle 
packing  problem.  The  2-dimensional  bin  packing  prob¬ 
lem  consists  of  a  set  of  orthogonal  rectangles,  with  each 

'This  work  was  partially  supported  by  NSF  grants  IRI- 
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rectangle  pi  having  height  hi  and  width  Wi,  and  a  ver¬ 
tical  strip  V  of  width  C.  The  objective  is  to  pack  the 
rectangles  into  P  so  as  to  minimize  the  height  of  V. 

Two  interpretations  of  2-dimensional  bin  packing 
help  illustrate  the  applicability  of  our  work.  First,  if 
we  do  not  allow  rotations  of  the  rectangles,  then  we 
can  interpret  the  problem  as  minimizing  the  comple¬ 
tion  time  of  a  computational  system  where  rectangles 
correspond  to  program  tasks.  The  height  of  a  rectangle 
corresponds  to  the  amount  of  processing  time  required 
and  the  width  corresponds  to  the  amount  of  memory 
required.  Notice  that  in  this  situation  C  represents 
the  total  memory  of  the  system  that  is  available.  Ob¬ 
viously,  in  this  situation  rotations  do  not  make  sense 
since  memory  cannot  typically  be  traded  for  process¬ 
ing  time.  The  second  application  is  to  stock-cutting 
where  “raw”  material  comes  in  rolls  from  which  we 
wish  to  cut  out  rectangular  patterns.  The  waste  of  the 
raw  material  is  minimized  if  we  minimize  the  length  of 
the  strip  used.  In  this  situation  it  may  be  reason¬ 
able  to  allow  the  rectangles  to  be  rotated  by  ninety 
degrees.  One  common  characteristic  of  these  appli¬ 
cations  is  their  ability  to  be  considered  in  a  dynamic 
sense  in  which  rectangles  are  input  in  a  stream  or  in  a 
static  sense  in  which  we  know  the  entire  rectangle  set 
prior  to  packing.  The  former  case  is  known  as  ‘on-line’ 
packing  while  the  latter  is  known  as  ‘off-line’  packing. 

Due  to  the  economic  importance  of  efficient  stock¬ 
cutting,  a  wide  variety  of  heuristic  methods  have  been 
developed  for  these  problems  over  the  last  20  years. 
(The  reader  is  referred  to  [1]  for  an  excellent  overview 
of  bin-packing  problems.)  These  algorithms  include 
“level”  (or  “shelf’  or  “strip”)  algorithms,  which  allow 
one  to  apply  knowledge  gained  from  the  1-dimensional 
bin-packing  problem  to  the  2-dimensional  case.  In 
this  paper,  we  propose  a  multi-tiered  algorithm  frame¬ 
work  that  promotes  modularization  for  a  variety  of 
these  level  algorithms.  The  first  tier  of  the  framework 
is  responsible  for  any  preprocessing  of  the  rectangles 
while  the  second  tier  packs  the  rectangles  in  a  sequen¬ 
tial  manner.  The  third  and  final  tier  post-processes 
the  packing  to  improve  the  packing  from  the  second 
tier.  Straight  forward  divide-and-conquer  techniques 
are  used  to  implement  this  framework  in  a  hypercube 
environment. 
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The  next  section  of  the  paper  proposes  the  tiered 
computational  framework  for  our  level  algorithms. 
The  two  ensuing  sections  reiterate  some  of  the  con¬ 
clusions  made  in  [4]  and  discuss  implementation  de¬ 
tails  on  the  iPSC/2.  The  fifth  section  investigates  en¬ 
hancements  to  the  algorithms  used  in  (4).  These  en¬ 
hancements  include  data  structure  improvements  and 
increased  attention  to  the  preprocessing  step.  Finally, 
we  present  our  conclusions  and  some  final  comments. 

Algorithm  Framework 

All  of  our  bin-packing  approximation  algorithms  are 
based  on  the  concept  of  level  algorithms  [2].  Such  level 
algorithms,  including  many  of  those  presented  in  [l], 
can  be  considered  in  the  following  three-tiered  frame¬ 
work; 

LI:  Preprocess  the  rectangles. 

L2:  Pack  the  rectangles  by  levels  with  each  rectangle 
being  placed  so  that  its  bottom  rests  on  one  of  the 
levels.  The  levels  are  determined  by  the  following 
constraints; 

•  The  bottom  of  the  first  level  is  the  bottom 
of  the  vertical  strip  V. 

•  Subsequent  levels  are  determined  by  a  hori¬ 
zontal  cut  through  the  top  of  the  tallest  rect¬ 
angle  in  the  previous  level. 

L3;  Post-process  the  resultant  packing. 

The  objective  in  each  of  the  three  tiers  of  this  frame¬ 
work  is  to  maximize  the  benefits  from  the  time/quality 
tradeoff  perspective.  Results  in  [4]  consider  the  ben¬ 
efits  in  performing  certain  preprocessing  steps.  For 
instance,  heuristics  that  rotate  the  rectangles  prior 
to  packing  improved  packing  efficiency  dramatically 
in  certain  cases  while  impacting  insignificantly  on  the 
running  time.  Likewise,  heuristics  used  in  tiers  L2 
and  L3  impact  upon  the  quality  of  the  final  packing  as 
well  as  the  running  time  of  the  algorithm.  Since  this 
framework  has  been  developed  so  that  we  may  consider 
off-the-shelf  2-dimensional  bin  packing  algorithms,  it 
lends  itself  to  modular  descriptions  of  the  component 
algorithms  used  in  each  tier. 

We  now  consider  each  of  the  three  tiers  in  turn 
by  describing  the  possible  computations  in  each  tier. 
Consider  a  total  of  N  rectangles  as  input  to  any  tiered 
2'dimensional  bin  packing  algorithm.  Let  T’  be  a  par¬ 
tition  of  the  N  rectangles  into  P  subsets  of  N/P  rect¬ 
angles  each.  For  each  preprocessing  algorithm  the 
asymptotic  running  time  within  each  partition  of  V 
will  be  given  as  a  function  of  N/P,  the  number  of 
rectangles  in  each  class  of  the  partition. 


Tier  Ll:  Preprocessing 

PI;  No  preprocessing.  These  algorithms  can  be  con¬ 
sidered  as  ‘on-line’  algorithms.  Time;  0(1). 

P2:  A  sort,  keyed  by  height,  in  each  partition  of  V. 
Time:  Q(N/P)  assuming  that  the  height  of  the 
rectangles  is  bounded  by  a  constant. 

P3:  A  rotation  of  the  rectangles  so  that  their  height 
is  greater  than  or  equal  to  their  width.  Time: 
6(Ar/p). 

P4:  A  rotation  of  the  rectangles  so  that  their  width 
is  greater  than  or  equal  to  their  height.  Time; 
e{N/P). 

In  the  case  that  the  prcproces.sing  occurs  ‘off-line’, 
the  number  of  2ilternative  preprocessing  algorithms  is 
bounded  only  by  the  number  of  one-to-one  and  onto 
mappings  from  the  input  rectangle  set  onto  itself. 

Tier  L2:  Packing 

In  this  packing  tier  we  have  considered  three  funda¬ 
mental  algorithms  in  conjunction  with  two  heuristics. 
The  three  level  packing  algorithms  follow.  The  asymp¬ 
totic  running  times  listed  reflect  the  analysis  consid¬ 
ered  in  [4]  where  P  corresponds  to  the  number  of  active 
nodes  and  N  equals  the  total  number  of  rectangles  to 
be  packed. 

Al:  Next  fit  packs  rectangles  left  justified  in  the  re¬ 
maining  unused  width  of  the  current  level.  If  a 
rectangles  will  not  fit  in  the  current  level  then  a 
new  level  is  initialized  with  this  rectangle  and  the 
packing  continues  with  the  new  level  assuming  the 
role  of  the  current  level.  Time:  &{N/P). 

A2:  First  fit  packs  rectangles  left  justified  into  the  re¬ 
maining  unused  width  of  the  lowest  level  that  they 
will  fit  in.  If  a  rectangle  wil  not  fit  in  any  of  the 
existing  levels  then  a  new  level  is  initialized  with 
this  rectangle.  Time;  Q{N^/P^). 

A3:  Best  fit  packs  rectangles  left  justified  into  the  re¬ 
maining  unused  width  of  a  level  they  fit  in  that 
minimizes  the  unused  width  of  all  such  levels.  If 
a  rectangle  will  not  fit  in  any  of  the  existing  levels 
then  a  new  level  is  initialized  with  this  rectangle. 
Time;  Q{N^/P^). 

The  two  heuristics  involved  in  the  packing  process 
were  considered  in  [4].  Since  these  methods  depend  on 
both  the  distribution  of  the  data  and  the  complex  rela¬ 
tionships  between  levels,  an  asymptotic  time  analysis 
seems  inappropriate. 
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Hi;  Level  Combining.  When  two  levels,  call  them  L\ 
and  L2,  are  complete  we  consider  combining  them 
in  a  simple  fashion  so  as  to  reduce  the  total  height 
taken  by  the  two.  This  combination  procedure 
consists  of  taking  one  of  the  levels,  say  Li,  re¬ 
versing  the  rectangles  (i.e.,  the  leftmost  rectangle 
becomes  the  rightmost,  etc.),  moving  the  rectan¬ 
gles  to  the  top  of  the  level,  and  then  lowering  the 
newly  rearranged  Li  on  top  of  L2  as  far  as  possi¬ 
ble. 

112;  Width  Covering /Level  Unpacking.  When  a  level 
L  is  complete,  we  will  unpack  the  rectangles  in 
L  if  the  unused  width  of  L  is  greater  than  some 
heuristic  factor.  All  unpacked  rectangles  will  be 
repacked  at  the  end  of  the  algorithm  by  using  a 
post-processing  algorithm. 

Tier  L3:  Post-processing 
The  post-processing  tier  has  been  considered  in  two 
possible  ways. 

01 :  If  there  are  any  rejected  rectangles  from  step  L2 
these  rectangles  are  accumulated  and  packed  us¬ 
ing  some  combination  of  the  packing  steps  in  LI 
and  L2. 

02:  No  post-processing. 

The  possible  operations  that  can  be  used  in  this  tier 
are  considerably  more  complex  than  those  in  tier  LI. 
In  fact,  the  set  of  possible  operations  could  consider 
interactions  between  levels  as  well  as  interactions  be¬ 
tween  individual  rectangles. 

Parallel  Realization 

The  framework  presented  above  can  be  applied  when 
using  hypercube  computers.  In  [3],  a  variety  of  hy¬ 
percube  solutions  were  given  for  solving  the  multipro¬ 
cessor  scheduling  problem.  These  algorithms  can  be 
viewed  as  instances  of  a  bottom-up  parallel  divide-and- 
conquer  solution  strategy,  where  initially  each  node 
solves  the  problem  on  its  own  set  of  data,  followed  by 
a  sequence  of  steps  where  these  partial  solutions  are 
combined  to  give  the  complete  solution.  Our  level  al¬ 
gorithms  follow  this  basic  approach.  A  level  algorithm 
template  that  incorporates  the  multi-tiered  framework 
is  given  below.  We  assume  that  given  an  input  set  of 
N  rectangles,  each  of  the  P  nodes  of  the  hypercube 
initially  assumes  responsibility  for  N/P  rectangles. 

Rectangle  Packing 

1.  Preprocessing  can  occur  at  a  local  level  when  the 
partition,  V,  of  the  rectangles  is  the  partition  in¬ 
duced  by  the  nodes.  Preprocessing  can  occur  at 


a  global  level  when  we  view  the  partition  as  com¬ 
posed  of  the  single  set  of  all  N  rectangles. 

2.  Every  node  uses  a  level  packing  algorithm  possibly 
augmented  by  a  heuristic  to  independently  pack 
its  rectangles  into  a  vertical  strip  of  width  C. 

3.  Recursive  doubling  is  used  to  combine  the  inde¬ 
pendent  solutions  into  a  global  solution.  Post¬ 
processing  heuristics  are  applied  in  this  stage. 

Previous  Results 

Several  combinations  of  the  tiers  LI,  L2,  and  L3  were 
considered  in  [4].  In  the  case  that  rectangle  rotations 
were  allowed  {P3  and  P4)  we  concluded  that  if  time  is 
critical,  then 

•  the  most  efficient  packing  of  the  rectangles  is  by 
the  next  fit  algorithm,  as  follows. 

—  If  the  number  of  rectangles  per  node  is  rel¬ 
atively  small  (e.g.,  8  or  fewer),  then  include 
the  preprocessing  heuristic  P4. 

—  Otherwise,  include  the  width  covering  and 
level  combining  heuristics,  HI  and  112. 

Ck)nversely,  if  time  is  not  as  critical,  then 

•  the  first  fit  algorithm  should  be  used  as  follows. 

-  If  there  is  a  relatively  small  number  of  rect¬ 
angles  per  node  then  widthwise  rotating  and 
the  level  unpacking  heuristic  should  be  used. 

—  Otherwise,  the  heightwise  rotation  of  rectan¬ 
gles  should  be  used. 

Next,  we  considered  the  packings  for  which  rectangle 
rotations  are  prohibited.  If  time  is  critical,  then 

•  the  most  efficient  packing  of  the  rectangles  b  using 
the  next  fit  algorithm  with  the  level  unpacking 
and  level  combining  heuristics. 

Conversely,  if  time  is  not  as  critical,  then 

•  the  first  fit  algorithm  should  be  used  as  follows. 

-  Given  no  more  than  32  rectangles  per  node, 
include  the  level  unpacking  heuristic. 

—  For  more  than  32  rectangles  per  node,  use 
the  straight  first  fit  algorithm. 


Implementation  Details 

Initially,  all  nodes  know  the  width,  C,  of  the  vertical 
strip  and  the  initial  seed  for  a  random  number  gener¬ 
ator.  When  the  program  begins,  the  host  broadcasts 
the  total  number  of  rectangles  to  be  packed  to  every 
node.  It  should  be  noted  that  all  nodes  know  the  same 
initial  seed  to  the  random  number  generator  so  every 
node  can  generate  a  distinct  set  of  random  rectangles. 
We  use  the  minimal  standard  generator,  as  described 
in  [11],  where  if  the  node  is  to  generate  k  rect¬ 
angles,  then  it  uses  2k  random  numbers  beginning  at 
random  number  2k{i  —  1)  -(- 1.  We  store  the  rectzuigles 
in  a  static  array  that  holds  the  maximum  number  of 
rectangles  that  will  ever  be  used  in  the  node. 

After  every  node  has  generated  its  rectangle  set  us¬ 
ing  the  random  number  generator,  the  set  of  active 
nodes  synchronize.  Each  node  continues  by  sampling 
the  clock  and  by  using  a  three-tiered  algorithm  to  pack 
the  rectangles.  Upon  completing  a  level  during  pack¬ 
ing,  the  node  accumulates  the  packing  statistics  for 
that  level.  When  the  recursive  doubling  step  is  per¬ 
formed  the  packing  heights  from  each  node  are  col¬ 
lected.  In  addition,  if  this  step  is  a  nodes  last  oper¬ 
ation  in  the  recursive  doubling  procedure  the  clock  is 
sampled  again,  the  running  time  of  this  node  is  de¬ 
termined  and  this  time  is  sent  to  a  neighboring  node 
in  the  recursive  doubling  procedure.  Of  course,  when 
heuristic  H2  is  used,  the  unpacked  rectangles  are  re¬ 
tained  and  passed  in  the  recursive  doubling  step  along 
with  the  other  relevant  packing  statistics.  The  running 
time  of  an  algorithm  is  simply  the  maximum  running 
time  of  all  nodes. 

Performance  Analysis 

In  this  section  we  discuss  the  performance  of  algo¬ 
rithms  that  extend  previous  results  found  in  [4].  Since 
we  use  randomly  generated  rectangles  as  input,  it  is 
not  possible  (in  the  sense  that  the  problem  is  NP- 
hard)  to  determine  the  optimal  packing.  Therefore, 
if  an  algorithm  A  packs  the  rectangles  into  vertical 
strip  V  (which  has  width  C)  using  height  D,  then  we 
use  the  percentage  of  the  area  CD  that  the  rectangles 
cover  as  a  measure  of  the  quality  of  the  solution  pro¬ 
duced  by  A.  We  ran  our  algorithms  on  inputs  of  size 
32, 64, 128, 256,...,  1048576,  using  1,  2,  4,  8,  16,  and 
32  nodes.  Our  results  consider  rectangles  with  height 
and  width  independent  and  uniform  on  (0 . . .  C]. 
Next  Fit 

The  next  fit  decreasing  height  algorithm,  one  vari¬ 
ation  of  next  fit  with  the  P2  heuristic,  was  previously 
explored  in  (4).  This  research  considers  the  next  fit 
algorithm  with  the  heuristics  P2  and  02  where  P2  is 
implemented  as  a  global  sorting  routine.  This  global 


sort  was  realized  as  a  local  sort  followed  by  a  merging 
operation.  In  the  merging  operation  each  node  would 
route  sets  of  rectangles  that  are  grouped  by  height  to 
the  nodes  responsible  for  rectangles  of  those  particu¬ 
lar  heights.  Upon  receiving  a  set  of  rectangles,  the  set 
is  merged  in  sorted  order  into  the  current  set  of  rect¬ 
angles  on  the  node.  In  general,  as  the  same  number 
of  rectangles  are  spre2ul  across  more  processors  this 
method  performs  better  than  the  next  fit  decreasing 
height  of  [4].  The  packing  improvement  is  the  high¬ 
est  when  using  32  nodes  and  is  generally  more  than 
1%.  Alternatively,  this  method  always  performed  5% 
worse  than  the  next  fit  decreasing  height  algorithm 
with  level  unpacking  ([4])  and  many  times  it  packed 
with  10%  less  efficiency. 

Interestingly,  given  a  fixed  number  of  rectangles 
greater  than  1024,  the  packing  efficiency  was  nearly 
identical  for  every  number  of  active  nodes.  In  fact, 
when  using  P  and  Q  nodes,  where  P  Q,  the  pack¬ 
ing  efficiency  differs  by  at  most  0.3%  in  every  case 
where  the  total  number  of  rectangles  is  larger  than 
1024.  This  phenomenon  is  presented  in  Figure  1  and 
is  explained  as  follows.  The  global  sort  distributes  the 
rectangles  in  ordered  intervals  across  all  nodes.  Given 
P  nodes  and  N  rectangles,  the  packing  in  the  P  nodes 
will  differ  only  slightly  from  the  one  in  2P  nodes.  In 
both  cases  the  first  N/2  rectangles  will  be  packed  iden¬ 
tically.  The  packing  for  the  second  half  of  the  rectangle 
set  in  2P  nodes  differs  from  the  packing  in  P  nodes 
by  at  most  the  height  of  the  level  that  contsuns  the 
(N/2  -h  1)**  rectangle.  For  a  large  number  of  sorted 
rectangles  the  height  of  any  two  neighbors  is  nearly 
identical  and  hence  the  level  heights  remain  nearly  the 
same.  The  dominance  of  the  time  in  the  higher  num¬ 
ber  of  nodes  is  attributed  to  the  overhead  associated 
with  the  global  sort. 

First  Fit 

A  major  improvement  in  the  computation  time  of 
first  fit  with  local  pre-sorting  was  achieved.  We  imple¬ 
mented  a  static  tree  structure  in  which  the  leaf  nodes 
were  heaps.  Each  heap  represents  all  of  the  levels  with 
a  particular  amount  of  unused  width.  The  root  of  each 
heap  contains  the  minimum  layer  number  with  that 
particular  amount  of  space  left.  The  search  for  the 
lowest  indexed  level  that  will  fit  an  input  rectangle 
starts  at  the  lowest  indexed  leaf  node  that  can  pos¬ 
sibly  fit  the  rectangle.  By  traversing  the  tree  from 
leaf  level  to  root  level  and  using  information  stored 
in  the  trees  nodes  the  correct  level  can  be  identified. 
For  32,768  rectangles  per  node  our  implementation  of 
first  fit  decreasing  height  with  a  static  tree  was  about 
18.8  times  faster  than  the  first  fit  decreasing  algorithm 
used  in  [4].  For  the  smaller  number  of  rectangles  the 
overhead  associated  with  the  static  tree  dominated  the 
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run  time.  In  the  worst  case,  the  static  tree  implemen¬ 
tation  was  6.5  times  slower  than  that  of  [4]  with  the 
static  tree  using  58  milliseconds  and  the  older  imple¬ 
mentation  t^dcing  9  milliseconds.  This  case  appeared 
for  32  total  rectangles  on  one  node. 

First  fit  augmented  with  global  version  of  heuristic 
P2  was  implemented  next.  Similar  effects  to  those  of 
next  fit  with  the  global  presorting  were  noticed.  But, 
the  effect  was  not  nearly  as  dramatic  since  neighboring 
rectangles  in  the  sorted  order  do  not  necessarily  belong 
to  the  same  or  neighboring  levels  in  tlie  packing.  In 
addition  to  this  effect  the  preprocessing  also  increased 
the  packing  efficiency  over  that  of  first  fit  decreasing 
height  for  any  given  number  of  rectangles  and  more 
than  one  node.  The  increase  in  packing  efficiency  was 
over  1%  many  times  but  never  more  than  about  2.2%. 
This  effect  is  best  explained  by  the  fact  that  any  given 
node  will  have  a  contiguous  set  of  sorted  rectangles. 
Thus,  in  a  first  fit  packing  less  space  will  be  wasted  by 
placing  relatively  short  rectangles  into  relatively  tall 
levels  since  the  heights  are  closer  in  proximity  to  each 
other  than  in  the  first  fit  decreasing  case. 

Best  Fit 

Best  fit  decreasing  height  had  similarly  striking  im¬ 
provements  in  its  data  structure.  Two  possible  im¬ 
provements  were  investigated.  The  first  improvement 
was  the  use  of  the  static  tree  structure  that  we  used 
for  first  fit.  The  only  difference  in  the  first  fit  and 
best  fit  implementations  is  in  tree  search  methods.  In 
the  best  fit  decreasing  height  implementation  the  low¬ 
est  indexed  leaf  node  that  contains  a  level  with  enough 
space  to  fit  the  next  rectangle  is  chosen.  Therefore,  the 
implementation  turned  out  slightly  faster  than  that  of 
first  fit  in  most  instances  since  the  amount  of  search 
logic  has  been  reduced.  In  the  best  case,  this  imple¬ 
mentation  was  about  167  times  faster  than  the  best  fit 
decreasing  height  algorithm  presented  in  [4], 

The  second  time  saving  measure  was  to  implement 
the  best  fit  decreasing  height  by  using  balanced  search 
trees  [13]  instead  of  static  trees.  The  implementation 
of  this  structure  improved  the  solution  in  two  respects. 
First  of  all,  we  were  no  longer  bound  to  the  imple¬ 
mentation  dependent  static  tree  and  secondly,  the  tree 
traversal  overhead  associated  with  the  static  tree  was 
minimized.  Thus,  this  implementation  of  best  fit  ran 
at  least  9%  faster  than  best  fit  decreasing  height  with 
static  trees.  The  largest  speedup  was  achieved  for 
smaller  number  of  rectangles  because  the  overhead  on 
the  static  tree  iniplementation  dominated  the  timing. 
Figure  2  presents  the  timings  and  performance  results 
using  a  balanced  search  tree. 

Comparison 

With  respect  to  the  algorithm  comparison  presented 


in  [4]  several  modifications  can  be  made.  We  are  now 
in  the  position  to  suggest  the  use  of  a  best  fit  algo¬ 
rithm  in  place  of  the  corresponding  first  fit  algorithm 
for  several  reasons.  As  we  have  pointed  out  earlier, 
the  balanced  search  tree  implementation  is  preferred 
since  it  is  faster,  more  elegant  and  more  flexible  than 
the  static  tree  implementation.  Additionally,  as  we 
mentioned  in  [4]  best  fit  and  first  fit  are  nearly  com¬ 
parable  in  their  peicking  efficiency.  The  time  critical 
suggestions  made  in  [4]  remain  valid  with  the  realiza^ 
tion  that  now  there  are  fast  implementations  of  best 
fit.  Application  writers  may  find  that  the  advantages 
of  a  much  better  packing  efficiency  with  a  slower  speed 
outweigh  the  advantages  of  the  fast  but  rather  ineffi¬ 
cient  packing  with  next  fit.  The  last  modification  is 
the  use  of  best  fit  in  the  cases  in  which  P3  and  P4  are 
prohibited.  The  best  packings  in  these  cases  will  be 
given  by  best  fit  augmented  with  the  global  presorting 
heuristic  in  the  case  that  there  are  more  than  32  rect¬ 
angles  per  node.  Otherwise,  if  there  are  less  than  32 
rectangles  per  node  use  best  fit  with  heuristic  112. 

Final  Remarks 

In  this  paper,  we  considered  a  multi-tiered  framework 
for  2-dimensional  bin  packing  algorithms.  This  three¬ 
tiered  framework  describes  many  level  algorithms  and 
is  extensible  enough  to  include  many  other  algorithms 
not  considered  here.  The  emphasis  in  each  of  these 
three  tiers  is  on  the  optimization  of  the  packing  qual¬ 
ity/packing  time  tradeoff.  The  tiers  describe  modular 
algorithms. 

We  included  new  results  to  complement  those  pre¬ 
sented  in  (4).  All  algorithms  were  implemented  for  in¬ 
put  sets  with  between  32  and  1048576  rectangles  on  an 
Intel  iPSC/2  with  between  1  and  32  active  nodes.  In 
particular,  we  considered  the  effects  of  global  presort¬ 
ing  as  well  as  improved  data  structures.  Conclusions 
were  drawn  incorporating  the  new  results. 

Our  future  plans  include  the  consideration  of  rela¬ 
tionships  between  the  width  of  the  vertical  packing 
strip  and  the  sizes  of  the  rectangles.  For  example,  it 
may  be  interesting  to  consider  the  heights  and  widths 
of  the  rectangles  chosen  independently  and  uniformly 
on  (0...6C],  for  6  <  1.  In  addition,  we  also  plan 
to  consider  ‘on-line’  algorithms  in  depth.  These  al¬ 
gorithms  typically  use  the  preprocessing  heuristic  PI, 
We  would  also  like  to  study  various  post-processing 
operations  and  the  theoretical  bounds  placed  on  level 
packing  algorithms  by  the  various  tier  components. 
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Figure  1:  Next  fit  with  a  global  pre-sort. 


Total  Number  of  Rectangles 


Total  Number  of  Rectangles 


Figure  2:  Best  fit  with  a  balanced  search  tree. 
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Abstract 

We  develop  an  efficient  subcube  recognition  algorithm 
that  recognizes  all  the  possible  subcubes.  The  algorithm 
is  based  on  exploiting  more  subcubes  at  different  levels  of 
the  buddy  tree.  In  exploiting  the  different  levels,  the 
algorithm  checks  any  subcube  at  most  once.  Moreover, 
many  unavailable  subcubes  are  not  considered  as 
candidates  and  hence  not  checked  for  availability.  This 
makes  the  algorithm  fast  in  recognizing  the  subcubes. 
The  number  of  recognized  subcubes,  for  different  subcube 
sizes,  can  be  easily  adjusted  by  restricting  the  search  level 
down  the  buddy  tree.  The  previous  known  algorithms 
become  a  specitd  case  of  this  general  ai^ioach.  When  one 
level  is  searched,  this  algorithm  performs  as  the  original 
buddy  system.  When  two  levels  are  searched,  it  will 
recognized  the  same  subcubes  as  the  ones  in  [4]  with  a 
faster  speed.  When  all  the  levels  are  searched,  a  complete 
subcube  recognition  is  obtained.  In  a  multi-processing 
system,  each  processor  can  execute  this  algorithm  on  a 
different  tree.  Using  a  given  number  of  processors  in  a 
multi-processing  system,  we  give  a  method  of 
constructing  the  trees  that  maximizes  the  overall  number 
of  recognized  subcubes.  Finally,  we  introduce  an 
allocation  method  "best  fit"  that  reduces  hypercube 
fragmentation.  Simulation  results  and  performance 
comparisons  between  this  method  and  the  traditional  "flrst 
fit”  are  presented. 

1.  Introduction 

Hypercube  multiprocessors  have  been  drawing 
considerable  attention  due  to  their  structural  regularity  for 
easy  construction  and  high  potential  for  parallel  execution 
[1-6].  Efficient  allocation  and/or  deallocation  of  node 
processors  in  the  hypeicube  is  a  key  to  its  performance 
and  utilization.  The  main  objective  is  to  maximize  the 
utilization  of  the  available  resources  as  well  as  minimize 
the  inherent  system  fragmentation.  In  this  paper  we 
introduce  an  efficient  algorithm  to  achieve  this  objective. 
This  problem  has  been  studied  in  [1-4]  using  strategies 
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based  on  buddy  and  gray  code  systems.  We  propose  two 
ways  to  extend  the  buddy  system  methodology  to 
maximize  the  number  of  recognized  subcubes  with  low 
firagmentation:  1)  an  extended  buddy  tree  and  2)  multiple 
extended  buddy  trees  which  is  suited  for  a  multi-processing 
system.  Also  we  introduce  a  new  heuristic  for  increasing 
the  availability  of  the  processors. 

The  buddy  system  algorithm  searches  for  a  subcube  of 
dimension  k  only  at  level  n-k  of  the  buddy  tree.  The 
extended'buddy  tree  algorithm  generalizes  the  search  for 
available  subcubes  to  cover  more  than  one  level  down  the 
buddy  tree.  When  the  search  covers  all  the  levels  beyond 
n-k  (i.e.  from  n-k  to  n),  complete  subcube  recognition  can 
be  achieved.  At  every  search  level,  the  proposed 
algorithm  recognizes  the  candidate  subcubes  at  that  level 
only  so  that  subcubes  recognized  at  previous  levels  are  not 
checked  again.  Also,  it  avoids  checking  many  unavailable 
subcubes.  Therefore,  this  algorithm  performs  significantly 
faster  than  exhaustively  searching  all  subcubes  at  that 
level.  The  method  becomes  faster  as  the  number  of 
searched  levels  increases. 

The  multiple  extended  buddy  trees  method  takes  advantage 
of  parallel  computers.  A  set  of  disjoint  trees  are  created 
and  distributed  among  concurrently  running  processors. 
Each  processor  executes  the  extended  buddy  tree  algorithm 
on  its  own  buddy  tree.  It  is  desired  to  have  different 
recognized  subcubes  at  each  processor  to  maximize  the 
total  number  of  recognized  subcubes.  Chen  et.  al.[l]  has 
studied  the  distribution  of  multiple  trees  with  depth  0 
among  many  processors  to  maximize  the  recognized 
subcubes.  In  section  3,  we  consider  the  general  problem  of 
distributing  the  buddy  trees  with  arbitrary  depths  to  the 
different  processws. 

A  hypercube  is  said  to  be  fragmented  when  there  are 
enough  processors  to  accommodate  a  subcube  request  but 
they  don't  form  a  subcube.  We  study  an  allocation 
strategy  that  increases  large  subcube  availability.  When 
many  candidate  subcubes  are  recognized,  the  one  with  the 
minimum  effect  on  the  larger  subcubes  is  allocated.  In 
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section  4,  this  (best  fit)  method  is  discussed  and  compared  section  p  =1).  Then  in  a  hypercube  of  dimension  n, 
to  the  traditional  first  fit  method.  Finally,  in  section  5  we  d  1)  =  T 2"'*^  for  0  S  d  S  k  □ 

give  some  concluding  remarics.  ’  ’  \  n-k  / 


We  first  introduce  some  notations  that  are  used  in  this 
paper.  Let  Q„  denote  a  hypercube  of  dimension  n  and  £ 

be  the  ternary  symbol  set  {0,1,*),  where  *  is  a  don't  care 
symbol.  Every  subcube  in  can  be  represented  by  a 

string  of  symbols  in  £  having  k  *'s.  For  example,  the 
address  of  the  subcube  Q2  formed  by  nodes  0010.  OOll, 
0110,  0111  in  is  0*1*.  Let  a*a*  represent  the 
subcubes  0*0*,  0*1*,  1*0*,  and  1*1*.  In  general,  a 
string  of  length  n  of  symbols  in  (a,*),  where  a  is  0  or  1, 
with  k  *’s  and  n-k  a's  represents  2”'*^  subcubes  of  size  k 
since  a  can  be  0  or  1. 

2.  Extended  Buddy  Tree 

In  the  buddy  system  [1],  a  binary  tree  of  n  levels,  as 
shown  in  Figure  1  for  n  =  4,  is  used  to  represent  the 
availability  of  some  subcubes  of  Q^.  In  the  buddy  system, 
subcubes  are  recognized  only  at  level  n-k.  There  will 

be  recognizable  subcubes  at  level  n-k,  namely 
aa...a**...*  with  n-k  a's  and  k  *'s,  i.e.  the  subcubes 
Qjn-k^k  pqj  example,  the  Q^'s  at  level  1  are  0***  and 
1***. 

By  extending  the  search  to  more  levels  Oevels  n-k,  n-k+1, 
...  etc.),  more  subcubes  can  be  recognized.  When  the  next 

level  is  searched,  i.e.  level  n-k+l,  then  2"**^ 

subcubes  can  be  recognized.  This  is  true  since  any  n-k 
bits  can  be  chosen  among  the  n-k+l  bits  in  the  tree, 
recognizing  the  following: 

<-  n-k+l-+ 

aa . a**. .  .* 

aa . *a* ...  * 

a . *aa*  ...  * 

*aa . a* ...  * 


It  is  clear  from  lemma  2. 1  that  increasing  the  search  dqjth 
(d)  will  increase  the  number  of  recognizable  subcubes. 
L^  D|^  represent  the  maximum  depth  used  in  searching  for 
Qjj  subcubes,  where  0  ^  k.  Also  let  Time(k,d) 

denote  the  number  of  comparisons  needed  to  search  for  a 
Q]^  subcube  using  one  tree  between  levels  n-k  and  n-k+d 

where  0  <  d  5  k.  Notice  that  N(kjk,l)  =  2"**^, 

which  is  all  the  possible  subcubes  of  size  k.  The  number 
of  subcubes  recognized  at  depth  d  (but  not  at  the 

previous  depths)  is  N(k,d,l)  -  N(k,d-l,l)  = 

2"'^.  The  maximum  search  time  occurs  when  all  the 
recognizable  subcubes  are  not  available  and  the  search  is 
forced  to  depth  Dj^.  Therefore,  Time(k,Dj^)  can  be 

bounded  as  shown  in  the  following  Lemma. 

Lemma  2.2:  Given  a  hypercube  of  dimension  n,  the 
search  time  needed  to  recognize  a  Qj^  subcube  using  depth 

parameter  Dj^,  is 

T.n,c(k,Dk)S  5  ("iM*)  ^ 

^n-k+Dj^^ 


Proof:  When  the  search  goes  through  all  depths,  i.e. 
0,1, ....  Dj^,  the  time  will  be 

Time(k.  Dr.)  1  Z  (N(k,d.l)  -  N(k,d-l.l))  2** 


is  bounded  by 


'n-k+Dk^2n-k+D,j 
n-k  J 


□ 


i.e.  the  subcubes  {  ♦  a‘  ♦'‘‘M  0  <  i  <  n-k  ). 

Using  a  similar  argument,  searching  to  depth  d  yields 
2"'*^  recognized  Qj^  subcubes,  as  stated  in 
following  Lemma. 

Lemma  2.1:  Let  N(k,d,p)  denote  the  number  of 
recognizable  subcubes  between  levels  n-k  and  n-k+d, 

i.e.  at  depths  0  to  d,  where  0  ^  d  £  k  using  p  trees  (in  this 


As  seen  from  Lemmas  2.1  and  2.2,  both  the  number  of 
recognized  Qj,.  subcubes  and  the  maximum  search  time 

increase  as  the  depth  parameter  Dj^  increases.  Figure  1 

shows  the  increase  in  both  log  maximum  time  and  log  the 
number  of  recognized  subcubes  in  Qg.  In  general, 

table  1  illustrates  the  spectrum  of  the  recognition  and  its 
maximum  time. 


Depth 

Number  of 
recognized  Qj^'s 

Maximum 
search  time 

Comment 

0 

jn-k 

jn-k 

Buddy  system 

2n-k+l 

2  2"- 

k+l 

Gray  code 

1 

n-k  \  ,n-k 

-k-1/2 

Die 

cr^) 

Dk 

I 

d=o 

C  n-k+d- A 
^  n-k-1  ; 

2n-k-Kl 

k 

(n'!k)  2"  '“ 

Si 

d=o 

'  n-k+d-1  "X 
L  n-k-1  J 

2n-k-Hi 

(  full 

recognition  ) 

Table  1. 

Example  2.1:  In  a  hypercube  of  dimension  4,  let  Dq=0, 
Dj=0.  D2=2,  D3=1,  and  0^=0.  Using  the  tree  shown  in 

Figure  1,  we  can  recognize  N(k,Djj,l) 

Qj^'s,  for  0  <  k  <  n.  That  is,  we  can  recognize  16  Qq’s,  8 
Qj’s,  24  Q2's,  4  Qj's,  and  1  Q4.  The  maximum  search 

time  for  recognizing  a  Q2,  for  instance,  is  X  (  ) 

6=0  ^  ' 

22+<i  ss  4x1  +  8x2  +  12x4  =  68  node  comparisons.  □ 

To  recognize  the  Qj^  subcubes  efficiently  at  depth  d,  i.e.  at 
level  n-k+d,  we  take  into  account  two  factors 

1)  It  is  not  necessary  to  check  the  subcubes  that  have 
been  checked  at  the  previous  levels. 

2)  Narrow  down  the  recognition  to  'candidate'  subcubes 
instead  of  exhaustively  checking  all  possible 

2"'*^  subcubes  at  depth  d. 

A  Qj^  can  be  formed  by  2^  subcubes  at  level  n-k+d. 
Any  Qj^  having  an  '*'  in  bit  number  n-k+d  is  generated  at 
some  level  before  n-k+d.  At  level  n-k+d,  we  only 
generate  subcubes  that  have  no  '*'  in  bit  n-k+d.  Only 
n-k-1  a's  can  be  chosen  among  the  n-k+d- 1  positions  to 


recognize 


subcubes  at  depth  d. 


The  candidate  Q|^  subcubes  at  level  n-k+d  are  generated  as 
follows.  Let  q  be  a  subcube  at  level  n-k+d.  If  q  is 
available  then  all  the  subcubes  containing  q  are 


considered  as  candidates.  When  q  is  not  available  all  the 
subcubes  containing  q  are  not  available  and  hence  not 

candidates.  The  other  2^-1  that  form  with  q  a 

candidate  subcube,  can  then  be  tested  for  availability. 
This  process  speeds  up  the  search  considerably. 

More  formally,  let  =  {  •"  ),  S^.j  =  {  0  1  ), 

Sn.2  =10  0  *"-2, 0  1  *"'2, 1  0  •"'2,  1  1  },...  etc. 

In  general  let  Sj  =  a”'‘  •*  =  {  0”*‘  . l”'*  )  for  0 

5  i  <  n,  i.e.  Sj^  is  the  recognized  Qj^  subcubes  at  level 
n-k.  Let  T  be  the  buddy  system  tree  of  Q„.  The 
algorithm  is  then  stated  as  follows. 

Algorithm  2.1:  Subcube  Recognition. 

Given  n,  k,  and  Di^,  the  algorithm  recognizes  f” 

n-k  y 

2**'^  subcubes  of  dimension  k. 

For  d  =  0toDj^do 

For  each  q  =  v j  V2  vj  ...  Vj,  e  Sj^.^  such  that  Tfq]  =  true 
(i.e.  available)  do 

1.1  Let  Q  be  the  set  of  aU  subcubes  that  contain  q. 

For  each  subcube  p  e  Q  do 

1.2  If  the  other  2^  - 1  forming  p  are  available 

then  p  is  available,  stop. 

The  set  Q  in  step  1.1  can  be  formed  by  changing  any  d  O's 
to  *'s  in  the  first  n-k+d-1  positions  of  q.  Notice  that  if 
position  n-k+d  was  included  then  all  previously  recognized 
subcubes  will  be  formed,  and  hence  this  position  is 
avoided.  Since  q  =  Vj  V2  vj  ...  Vj^  where  v-  =  '*'  for 
n-k+d+1  <  i  <  n,  so  Q  -  '  p  -  n.  .,2  ... 

I  where  the  number  of  *'s  in  p  is  k  [i.e.  there  are  d  *'s 
in  (uj  U2  ...  “n-k+d- 1^^  “i  ~  *  Vj  =  0  ).  In 
general,  the  d  *'s  can  be  chosen  in  any  of  the  positions 
from  1  to  n-k+d-1,  however  when  only  the  positions 
containing  O's  (or  I's)  are  considered,  every  at  this 
level  will  be  generated  by  exactly  one 

In  step  1.2,  given  any  subcube  p  e  Q,  where  p  =  Uj  U2  ... 

**n-k+d-l  '  ^k-d  subcubes  that  form 

p  are  obtained  by  enumerating  the  d  *'s  in  the  first 
n-k+d- 1  positions  of  p.  These  subcubes  (having  the 

last  k  d  positions  as  *'s)  can  be  directly  checked  at  level 
n-k+d  of  the  tree.  □ 

The  number  of  recognized  subcubes  can  be  controlled  by 
setting  appropriate  values  of  for  different  subcube 


sizes.  For  instance,  when  =  0  for  0  ^  k  ^  n,  this 
method  performs  as  the  original  buddy  system.  When  I>|^ 
=  1  for  0  ^  k  £  n,  this  method  recognizes  the  same 
subcubes  as  in  [4]  with  greater  speed.  In  this  case,  this 
algorithm  performs  faster  than  the  one  in  [4],  especially 
when  the  system  is  highly  loaded,  since  many  unavailable 
subcubes  are  not  considered  as  candidates  whereas  in  [4]  all 
possible  subcubes  at  that  level  are  candidates.  The  gray 
code  strategy  is  somewhoe  between  depths  0  and  1  (more 
towards  the  0).  On  the  other  extreme,  when  =k,  for  0 

^  k  ^  n,  all  the  possible  subcubes  can  be  recognized. 

In  example  2.1  (using  Figure  1),  all  the  24  Q2's  are 
recognized  since  D2=2,  i.e.  searching  depths  0, 1,  and  2. 
The  following  tabulation  shows  the  recognized  (but  not 
necessarily  candidate)  subcubes  at  each  level  of  the  tree. 


d  =  depth 


level  recognized  #  of  subcubes  recognized 
subcubes  only  at  depth  d 


0 

1 

2 


2  aa**  4 

3  a*a*  and  *aa*  8  with  bit  3  ^  * 

4  a**a,  ♦a*a,**aa  12  withbit4^* 


Recall  that,  as  in  Lemma  2.1,  given  a  tree  T  =  [a^,  a2. .... 


iij,],  the  number  of  recognized  Qj^'s  up  to  depth  d  is 


2"'^.  These  subcubes  are  of  the  form  V  =  Vj  V2 . v^j 

such  that  V  has  n-k  a's  and  k  ^'s  and  the  a's  can  only 
^pear  in  the  positions  ai,  a2 . or  3,,.]^^. 


Let  Cj  and  C2  be  the  sets  of  the  recognized  subcubes 

using  Tj  =  [aj,  32 . 84,]  and  T2  =  [bj,  b2 . b„]  to 

depth  d,  respectively.  Cj  and  C2  are  distinct,  i.e.  Cj  n 

C2  =  <|>,  iff  I  {aj,  32, ...  an-k+dJ  f*’l-  *’2’  ’  '’n-k+d^  ' 
<  n-k.  This  is  true  since  the  n-k  a's  in  Cj  can  never  be 
in  the  same  positions  as  the  n-k  a's  in  €2- 


For  example,  consider  Tj  =  [1, 2,3,4]  in  Figure  1.  When 
0^=1,  the  recognized  Q^'s  at  levels  1  and  2  are  a***  and 

*a**,  and  when  T2  =  [4,3,2,1]  in  Figure  2  is  used,  the 
recognized  Q2's  are  """a*  and  ♦♦•a.  The  subcubes  are 

disjoint  since  I  { 1,2)  n  (4,3)  I  <  n-k  =  1, 

Distinct  Trees  Matrix  (DTM); 


Example  2.2:  Consider  the  hypercube  of  dimension  4 
represented  by  the  tree  T  in  Figure  1  where  the  dark 
subcubes  are  occupied  and  the  light  ones  are  available. 
The  following  sequence  illustrate  how  algorithm  2.1 
proceeds  to  recognize  Q2  when  D2  =  2. 

When  d=0,  the  subcubes  00**,  01**,  10**,  11**  are 
considered  but  they  are  all  not  available.  When  d=  1 , 

Since  q  =  (KK)*  is  busy  the  subcubes  0*0*  and  *00*  are 
skipped.  Since  q  =  001*  is  available  we  consider  the 
candidate  set  Q  =  {0*1*,  *01*  }.  0*l*  =  0Ql*and011* 
but  01 1*  is  not  available  so  0*1*  is  not.  101*  =  QOl* 
and  101*  but  both  are  available  so  *01*  is  available.  □ 

3.  Multiple  Extended  Buddy  Trees 

In  this  section  we  study  the  performance  of  the  extended 
buddy  system  when  multiple  trees  are  employed.  In  a 
multi-processing  system  these  trees  can  be  assigned  to 
different  processors  to  speed  up  the  recognition  process. 

Let  [a|,  32. ....  a„]  denote  a  tree  that  splits  at  the  first 
level  according  to  bit  number  a^,  then  at  the  second  level 
according  to  bit  32.  ...  and  so  on.  Every  bit  position 
appears  exactly  once,  i.e.  (  3],  32, ....  3,)  )  =  {  1.2, ....  n 
).  The  tree  shown  in  Figure  1,  say  T|,  is  a  [1, 2,3,4]. 
Figure  2  shows  another  tree  T2  =  (4,3,2,!]. 


A  DTM  for  a  hypercube  of  dimension  n  represented  by  m 
trees,  is  defined  to  be  an  mxn  matrix,  where  each  row 
represents  a  deferent  tree. 

/  X,  ,  Xi 


Let  DTM  = 


*11  *12  •  *ln 
X21  X22  ...  X2n 


\*ml  *m2  •••  *mn/ 


where  row  i  is  a  [xjj,  Xj2,  xjj^]  tree,  therefore 
I  Xjj,  Xj2 . *in  1  “  ^  ^ . "  ^  for  1  S  i  S  m  (3.1) 


A  matrix  M  is  considered  a  DTM  if  for  any  i  distinct 
numbers  from  Z„  there  is  at  least  one  row  in  M  such  that 

these  i  numbers  appear  in  the  Hrst  i  +  positions  in 
that  row.  When  M  has  few  rows,  not  all  the  permutations 
are  achievable.  In  order  to  maximize  the  total  number  of 
recognized  Qj^’s,  the  DTM  must  have  the  maximum 

possible  permutations.  More  formally,  let  Sj^  ^  be  the  set 
of  all  possible  subsets  of  {  xjj  Xj2  ...  xj  (n-k)+Yk)  ^ 
size  n-k.  Let  Cj^  =  {  Sj^  j  1 1  ^  j  <  m  ) ,  i.e.  represents 
the  set  of  the  recognized  subcubes.  In  the  construction 
of  the  DTM,  the  C|^'s  must  be  maximized,  so  a  DTM 
must  satisfy  the  following  property. 

'Ck'  =  min(m  (n"k)  > 

for  0  S  k  S  n  (3.2) 


Example  3.1:  Let  n  =  4.  m  s  6.  and  Y]^=0  for  all  k. 
The  following  6x4  matrix  is  a  DTM 

12  3  4 

2  3  4  1 

3  4  12 

4  12  3 

13  2  4 

2  4  13 


Using  the  6  trees  (with  depth  0)  given  in  the  above  DTM 
allows  us  to  recognize  all  possible  subcubes  in  a 
hypercube  of  dimension  4.  For  example  all  the  Q2 

subcuhes  are  recognized  because  in  columns  1  and  2  all 
the  possible  combinations  exist.  For  instance,  the  a**a 
Q2  subcubes  are  recognized  from  the  [4,1 ,2.3]  tree.  □ 

Example  3.2  Let  n  =  6,  m  =  3,  Yj^  =  1  for  0  <  k  <  6. 
Then  the  following  matrix  is  a  DTM. 


M  = 


/I  2  3  4  5  6  \ 
3  4  5  6  1  2 
1,  5  6  1  2  3  4  j 


The  trees  in  M  give  the  maximum  recognized  subcubes. 
In  this  case,  all  the  subcubes  of  size  0,  1,  5,  and  6  are 
recognized:  and  the  recognized  subcubes  of  size  2, 3,  and  4 
are  maximized.  For  instance,  the  **a*a*  Q2  su^ubes 

are  recognized  from  the  [3,4,5,6,1,2]  tree  since  (3,5) 
appear  in  the  first  n-k+Yj^  *  3  positions  of  this  tree.  □ 


Ut  Y  =  max  {  Yj^  I  0  <  k  <  n  ).  A  DTM  of  or 

less  rows  can  be  formed  by  letting  the  first  row  be  (1.2, 
....  n}  and  row  i  be  a  Y+1  left  circular  shift  of  row  i-1,  for 

1  ^  i  5  L^rJ'  niatiix  M  in  the  previous  example  is 

generated  using  this  method.  When  all  the  depths  are  0 
then  a  DTM  with  any  number  of  rows  can  be  constructed 
using  matching  theory  [1].  In  the  presence  of  a 
multi-processing  system  with  p  processors,  a  DTM  with 
p  rows  and  n  columns  provides  the  best  tree  to  processor 
assignment.  That  is,  processor  i  performs  the  search 
using  the  tree  in  i^  row  of  the  DTM,  for  1  S  i  5  p. 

Lemma  3.1:  Given  p  processors,  let  M  be  a  pxn  DTM. 
The  maximal  subcube  recognition  is  achieved  by 
assigning  the  p  trees  of  M  to  the  p  processors.  And, 

N(l,Yk,p)  =  mio(  2"*  p  ) 

where  N(k,Y|^,p)  is  the  total  number  of  the  recognized 
Q|^'s  up  to  depth  Y|^  in  the  p  trees  of  M.  □ 


As  seen  from  Lemma  3.1,  when  each  processor  in  the 
parallel  system  is  assigned  a  different  tree,  the  number  of 
the  recognized  Qj^'s  is  linearly  dependent  on  the  number  of 

processors. 

Lemma  3.2:  Given  an  pxn  DTM,  M,  and  Y^  for 
(}<k^.  Let  R|^  denote  the  number  of  the  recognized  Q^'s, 

where  0  S  2"*^,  then 

Rk 

p  S  -  for  0  ^  k  <  n. 

Proof:  From  lemma  3.1,  any  tree  can  recognize 
2"“*^  Ok  subcubes  when  searched  to  depth  Yj^.  Therefore, 
all  the  trees  in  the  DTM  can  recognize  at  most  p 
2"'*^  Qk  subcubes  when  searched  up  to  depth  Yj^,  i.c.  Rj^ 


Qmsider  the  following  special  cases  of  the  Lemma  3.2. 


1)  Suppose  that  Rj^  =  ^k  ~  ® 

i.e.  complete  recognition  with  zero  depth.  Then, 


for  0  <  k  5  n.  The 


maximum  occurs  when  k  =  n/2,  in  which  p  ^ 


2)  Suppose  that  Rj^  =  (^-k)  ^k  ~ 

In  this  case  m  ^  1  for  0  ^  k  ^  n.  This  confirms  that  a 
single  tree  can  recognize  all  the  subcubes  when  searched 
up  to  dq>th  k. 

Figure  3  illustrates,  for  n=8  and  k=6,  the  relation  between 
the  search  depth  in  each  tree  and  the  number  of  required 

trees  needed  to  recognize  all  the  (  8%  )  2^'^  Qg  subcubes 
in  Qg.  The  figure  also  shows  the  log  of  the  maximum 
search  time  for  the  different  depths. 

As  seen  in  section  2,  a  uni-processor  can  recognize 


a 


N(k,Dj^.l)  Qk  ® 

I\ 

Z  (  /  2"*^'*'‘^  time  units.  Using  p  processors, 

(to 

the  N(k,D]^.l)  Qj^'s  can  be  recognized  much  faster  since 
the  search  depth  (Y^.)  in  each  processor  is  smaller  than 
D|j.  To  recognize  N(kJ)jj,l)  Q^’s  using  p  trees,  the  trees 
must  be  searched  up  to  depth  Yj^  where  p  ♦  N(k,Y|.,l)  2i 

TT  •  ■  /n-k+Yv\  /m-k+l\NYk-Dk 

Using  the  approximation  (  *  1 «  I  ^  1 

I  n-k  j  {  ^  J 

/n-k+Di.\ 


Yk  -  D|,  - , - -  3.2.1 

*  It  log  n  -  log  k 

Let  T]  be  the  maximum  time  to  recognize  all  possible 
subcubes  in  with  one  processor.  Let  Tp  be  the  time 
to  recognize  all  possible  Qj^  subcubes  in  using  the 
best  parallel  algorithm  on  a  parallel  system  of  p 
Tl  * 

processors,  so  T_  =  — .  Let  T„  denote  the  maximum 
P  p  p 


The  improvement  is  mainly  due  to  the  usage  of  the 

* 

distinct  multiple  trees  in  Tp.  So  it  is  faster  to  distribute 

the  trees  among  the  processors  rather  than  distributing  the 
original  function. 

Also,  notice  that  m(»e  than  one  tree  can  be  used  at  each 
processor,  and  so  all  these  distinct  multiple  trees  can  run 
on  one  prcKessor.  This  will  yield  a  faster  subcube 
recognition;  however,  the  major  drawback  of  this  method, 
other  than  increased  memory,  is  the  time  taken  to  update 
the  trees  after  each  subcube  location. 

4.  Subcubes  Recognition  with  High 
Availability  (Low  Fragmentation). 

In  the  sections  2  and  3  the  emphasis  was  on  the  number 
of  recognized  subcubes.  When  low  fragmentation  is 
desired,  an  allocation  strategy  that  chooses  among  the 
recognized  subcubes  must  be  employed.  This  problem  is 
similar  to  the  traditional  memory  system  where  the 
memory  is  allocated  based  on  some  strategy,  say  first  fit, 
best  fit,  worst  fit,  etc.  The  fragmentation  problem  also 
extends  to  the  deaIl(x:ation,  i.e.  when  a  subcube  becomes 
available.  In  this  case  one  can  also  rearrange  the 
ix-ocessor-task  mapping  to  minimize  fragmentation.  This 
is  somewhat  simito  to  compacting  the  memory. 


time  to  recognize  all  possible  subcubes  using  the 
multiple  trees  methcxl  with  p  processors.  Let  improve(p) 


improveODsi  (^)'' 


where  h  = 


_ logP 

log  n  -  log  k 


3.2.2 


The  current  allegation  methods  can  be  considered  as  first 
fit  since  the  first  available  subcube  is  chosen  for 
allocation.  We  propose  a  new  method  similar  to  the  best 
fit  in  memory  systems.  This  method  chooses  the 
subcube,  among  all  available  subcubes,  that  leaves  the 
maximal  unfragmented  system.  Simulation  results  and 
performance  comparisons  between  this  method  and  the 
"first  fit"  will  be  presented.  We  start  with  some 
definitions. 


This  can  be  proved  as  follows.  The  max  time  to 
recognize  all  Qjj's  using  one  tree  is  Tj  =  2""*'  2*^ 

_  . . 


and  hence  Tp  = 


-.  When  p  trees  are  used  then 


T*=  f  where  v  -  k  -  _ _ 

*p  V  n-k  y  *  log  n  -  log  k  ’ 

SO 

*enin.pro«(p)2l  (^)'’  where  h= 


Let  A  =  <ap,  ap.|,  . aQ>  denote  the  subcube 

availability  vector,  where  aj^  is  the  number  of  available 
(and  recognizable)  Q^.  subcu^.  We  call  A  the  "state"  of 
the  system.  The  initial  state  is  <N(n,Dp,l), 

N(n-l,Dp.j,l), . N(0,Dq,1)  >.  Let  A  =  <ap,  a^.j, ..., 

aQ>  and  B  =  <bp,  bp.  j bQ>  be  two  states  and  let  j  be 

the  largest  integer  ((kjSn)  where  *^^bj.  We  say  that  A  is 
less  fragmented  than  B  iff  aj  >  bj,  i.e.  lexicographic  order. 
This  metric  gives  more  weight  to  larger  subcubes  than 
smaller  ones. 

Let  q  be  a  recognizable  and  available  subcube  and  let  L^  = 
<lp,  Ip.j,  ...,  I()>  be  the  loss  vector  resulting  from 


<9 


allocating  q.  The  loss  vector  implies  that  1|  Q^,  for 

O^i^,  recognizable  subcubes  are  made  unavailable  as  a 
result  of  allocating  q.  The  new  state  after  allocating  q  will 
be: 

new  state  =  current  state  - 

In  order  to  achieve  less  fragmentation  (high  availability) 
when  allocating  a  new  subcube,  we  choose  the  one  that 
causes  the  minimum  loss  to  large  subcubes  in  the 
allocation  process.  This  is  illustrated  in  the  following 
algorithm. 

Algorithm  4.1:  Best  Fit  Allocation  (using  a  single 
tree). 

o  Let  S  be  set  of  all  the  recognized  and  available  Qj^'s. 
o  Let  q  €  S  be  the  subcube  such  that  L^  <  Lp  for  any  p  €  S. 
o  Allocate  q  (if  any). 

o  The  tree  and  the  state  are  updated  accordingly.  □ 

In  the  above  algorithm,  when  all  the  depths  are  restricted 
to  zero,  Lq  can  be  easily  calculated  by  counting  the 
number  of  ^'s  that  become  unavailable  at  level  n-k  as  a 
result  of  allocating  q.  When  q  is  of  dimension  k,  will 

be  of  the  form  <0, ...,  0, 1, ...,  1,  2,  4, ...,  2*^'^  2*^>.  So 
Lq  can  be  determined  by  the  largest  lost  subcube 

corresponding  to  the  first  "1"  in  Lq.  When  the  depths  are 
arbitrary,  the  computation  of  might  take  longer  time. 

A  simulation  was  performed  to  analyze  the  new  method. 
The  following  parameters  were  varied:  hypercube  sizes, 
load  factors,  and  the  sizes  of  the  requested  subcubes.  For 
the  lava,  the  uniform  and  geometric  distributions  where 
used.  Figure  4  is  for  Qg  with  system  load  between  80% 

to  100%.  The  geometric  distribution  was  used  to  generate 
the  size  of  the  requested  subcubes.  At  every  time  unit  a 
random  subcube  is  chosen  from  the  allocated  subcubes  to 
be  released.  This  gives  a  semi-exponential  distribution  for 
their  execution  time.  The  simulation  was  run  for  1000 
time  units,  after  reaching  a  90%  load  factor.  Figure  4 
suggests  that  for  higher  subcube  sizes  the  best  fit 
performs  considoably  better  than  the  first  fit  method. 

When  multiple  trees  are  used,  let  A^  be  the  state  of  the 

system  using  the  tree  at  processor  i,  for  l^i<p.  The 
system  (global)  state  is  then  defined  as  the  maximum 
(element  wise)  of  the  A^'s.  In  this  case,  algorithm  4.1  is 

executed  at  each  processor.  Processor  i  then  sends  its  new 
state  Aj  and  its  candidate  subcube  q^,  i.e.  with  the  one 

with  the  lowest  loss.  The  "host"  collects  this  information 
and  chooses  the  best  among  qj's.  Algorithm  4.1  can  be 


modified  fcM'  multiple  processors  (trees)  as  follows. 

Algorithm  4.2:  To  recognize  a  Qj^  subcube  with  low 
fragmentation  using  p  trees. 

o  Let  qj  and  Aj  be  the  candidate  subcube  and  the  new 

state,  respectively,  of  processw  i. 
o  Processor  i  sends  its  qj  and  A|  to  the  host 
o  The  host  computes  m  such  that  A,„  >  Aj  for  l£i<p. 
o  Allocate  q^  (if  any). 

o  Each  processtM'  updates  its  tree  and  state  accordingly.  □ 

To  completely  remove  system  fragmentation,  deallocation 
must  be  considered.  When  a  task  releases  a  given 
subcube,  the  cube  might  become  fragmented.  In  this  case 
one  can  rearrange  the  task-processors  allocation  to  remove 
this  fragmentation.  This  process,  referred  to  as  task 
migration  [3],  has  a  high  overhead  since  it  requires  task 
deallocation  and  allocation.  Task  migration  can  be  done  at 
every  deallocation,  if  fragmentation  exist,  to  maintain  an 
unfragmented  system.  However,  when  the  system  is 
highly  available  it  may  not  be  worthwhile.  The  other 
approach  to  this  problem  is  to  compact  the  whole  system 
who)  the  fragmentation  exceed  some  threshold. 
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Figure  4.  Hit  ratio  for  different  sizes  of  subcubes 

Figure  2:  A  [4, 3,2,1]  tree-  allocation  using  best  fit  and  first  fit  allocation. 
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Abstract 

A  parallel  thinning  algorithm  based  on  boundary 
following  is  presented  in  this  paper.  The  boundary  of 
each  object  region  is  extracted  ar^  linked  in  parallel. 
The  resulting  object  boundary  data  is  divided  based  on 
the  object  size  and  the  nunber  of  nodes  for  load 
balancing,  then  the  divided  objects  are  redistributed  to 
the  nodes.  Each  boundary  in  a  node  is  projected  on  a 
“working  plane".  Next,  the  boundary  data  is  repeatedly 
shrunken  until  only  the  skeleton  of  the  region  remains. 
The  conventional  iterative  parallel  algorithm  as  well  as 
our  new  algorithm  are  implemented  on  a  hypercube- 
topology  multiprocessor  computer,  the  Intel  iPSC/2.  The 
two  algorithms  are  compared  and  analyzed.  Some 
resulting  figures  and  execution  times  are  presented. 


1 .  Introduction 

Thinning  is  one  of  the  most  important  procedures  in 
pattern  recognition  and  image  data  reduction,  but  it  is 
a  very  time  consuming  procedure.  In  most  existing 
thinning  algorithms,  several  templates  (usually  3x3)  are 
scanned  on  the  image  for  deleting  boundary  points  but 
not  the  skeleton  of  an  object.  This  procedure  is 
repeated  until  no  points  are  deleted,  (i^iously  the 
complexity  of  the  whole  procedure  is  O(n^)  [1].  In 
order  to  reduce  processing  time  many  parallel  algorithms 
[2-7]  has  been  proposed  which  can  be  easily  implemented 
on  currently  available  mesh  computers.  Unfortunately, 
those  parallel  thinning  algorithms  are  undesirable  for 
implementation  in  distributed  memory  coixputers  because 
the  global  shapes  of  the  objects  in  an  image  might  be 
affected  when  the  image  is  divided  and  distributed  to 
each  node.  To  avoid  this  problem,  data  swapping  between 
nodes,  that  is,  communication,  must  be  performed  at 
every  iteration  (81. 

In  this  paper,  we  propose  a  new  parallel  thinning 
algorithm  based  on  boundary  following  and  shrinking. 
Each  object  boundary  in  the  image  is  extracted  and 
linked  in  parallel.  The  mnber  of  objects  is  divided 
based  on  the  number  of  nodes  and  object  size.  Then  the 
objects  are  distributed  to  the  nodes  and  thinned  in 
parallel  by  following  the  boundaries  and  shrinking  them 
in  the  direction  perpendicular  to  the  boundary  and 
pointing  toward  the  inside  of  the  oiuect.  This 
algorithm  reduces  the  complexity  from  O(n^)  to  0(n^}. 


2.  Parallel  Boundary  Detection 
and  Object  Extraction 

To  avoid  the  problem  of  breaking  the  objects  between 
nodes,  we  extract  the  objects  and  thin  them  individually 
in  each  node  in  parallel.  In  this  section  we  introduce 
a  parallel  boundary  detection  and  object  extraction 
method  on  a  distributed  memory  computer. 

2.1  Input  Image  Distribution 
Method 

Poor  performance  can  result  if  processor  loading  is 
uneven.  In  order  to  maximize  the  performance,  the 
amount  of  data  loaded  to  each  node  must  be  balanced.  A 
distribution  method  in  which  the  image  plane  is  divided 
into  rows  as  evenly  as  possible  according  to  the  number 
of  nodes  being  used  and  each  resulting  sub- image  is 
distributed  to  each  node  is  conmonly  used  in  parallel 
image  processing.  It  is  obvious  that  the  processing 
time  can  be  reduced  in  theory  to  0(1/N)  by  using  this 
distribution  method,  where  N  is  the  number  of  nodes 
being  used.  Here  some  rows  on  the  borders  of  each  node 
need  to  be  shared  with  its  adjacent  nodes  for  detecting 
boundaries  by  using  the  template  matching  method 
described  in  the  next  section.  The  number  of  shared 
rows  depends  on  the  number  of  rows  of  the  template;  r/2 
rows  need  to  be  shared,  where  “r“  is  the  number  of  rows 
of  the  template  used. 


2.2  Boundary  Detection 

Let  us  assume  that  we  have  a  digitized  binary  image. 
Then  an  object  region  in  the  image  can  be  simply 
represented  by  the  set  of  boundary  points,  or  connected 
edge  points,  of  the  object.  The  connectedness  of  two 
boundary  points  in  a  binary  image  depends  on  the 
definition  of  the  neighborhood:  four-neighbor  (N^)  or 
eight-neighbor  (N^)  [9].  N^  and  Ng  neighborhoods  of  a 
point  at  (i,j)  coraist  of  the  follwing  points: 

*4  «  ai.i-l),  (i.j+1),  <i*1.i))  (1) 

N-  »  (M,,  (i-1.j-1),  (i-1.j*1).  (i^1,j-1). 

(i*1,j*1)>  (2) 

In  this  paper  we  have  chosen  eight-neighbor  as  the 
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definition  of  the  neighborhood.  Two  points  are  eight- 
connected  if  they  are  eight-neighbors  of  each  other. 
Figure  1  shows  a  3x3  tenplate  according  to  the  eight- 
neighbor  definition. 

Let  us  assune  that  the  size  of  the  input  binary  image 
is  nxn,  and  no  object  points  touch  the  periphery.  Then 
the  boundaries  can  be  detected  by  the  following 
procedure: 

procedure  Boundary_Oetection; 
begin 

for  j=0  to  n-1  do  begin 
for  i=0  to  n-1  do  begin 

if  I<i,j)=1  and  I(p,q)s1  for  all 
(p.q)  c  N  *•*'“'  i'=n. 

else  B(i,])= 
end; 
end; 
end; 

where  I  is  an  array  representing  the  input  image,  and  B 
is  an  array  representing  the  output  which  contains  only 
boundary  points.  This  procedure  is  performed  in  each 
node  for  its  sub- image  in  parallel. 


Pe 

Pi 

P2 

(i-ij-i) 

(i+1J-1) 

P? 

Po 

Ps 

<ij) 

(1  +  1J) 

Pfi 

PS 

P4 

(1+1J+1) 

In  order  to  link  the  local  objects  in  the  nodes 
globally,  the  local  objects  which  have  top  border 
eleinents  or  bottom  border  elements  must  be  sent  to  their 
adjacent  nodes  and  linked  one  after  the  other  for  all 
nodes.  Since  this  step  is  a  kind  of  sequential  top-down 
linking  procedure  and  it  involves  an  exhaustive 
comparison  between  every  pair  of  'bbe'  and  'tbe',  it 
might  cause  some  degradation  in  the  parallelism.  To 
maximize  the  parallelism,  we  have  proposed  a  parallel 
linking  algorithm  in  [10].  The  theoretical  speed-up  is 
log2N,  where  N  is  the  number  of  nodes  being  used. 


Node  0 
Node  1 


Node  N- 2 
Node  N-1 


Figure  2.  Boundaries  (bound),  bottom  border  elements 
(bbe),  and  top  border  elements  (tbe). 


3.  Parallel  Object  Thinning 


Figure  1.  3x3  template  according  to  the  eight- 
neighbor  definition 


2.3  Object  Extraction 

As  the  result  of  the  boundarv  detection,  each  node 
has  boundary  points  only  for  its  own  sub- image.  Now,  we 
extract  the  objects  locally  by  the  boundary- following 
method  which  is  described  in  [9]  (see  ch.A).  The  data 
structure  for  an  object  might  be  the  following: 

typedef  struct  < 

int  nunofbound,  /*  nuiber  of  boundary  •/ 


/*  points  */ 


nunoftbe. 

/*  nunber  of  points  on 

V 

/•  top  border 

*/ 

numofbbe; 

/•  number  of  points  on 

*/ 

/*  bottom  border 

V 

XYCRO  *bound. 

/*  pointer  for  boundary 

*/ 

/*  data 

*/ 

*tbe. 

/*  pointer  for  top 

*/ 

/*  border  elements 

*/ 

*bbe; 

/*  pointer  for  bottom 

*/ 

/*  border  elements 

*/ 

>  OBJECT; 

where  XYCRO  is  another  data  structure  for  a  data 
position.  Figure  2  illustrates  the  'bound',  'tbe',  and 
'bbe'. 


For  load  balancing,  the  root  node  collects  the 
information  of  the  mmber  of  objects  and  their  sizes 
from  all  nodes.  Then  the  root  node  divides  the  number 
of  objects  according  to  the  nuiber  of  nodes  being  used 
and  the  object  sizes  and  redistributes  the  objects  to 
all  nodes.  In  the  global  object- 1  inking  step  described 
in  the  previous  section,  the  boundary  data  of  the 
objects  might  be  shuffled.  To  rearrange  the  data,  we 
project  the  objects  on  a  'working  plane'  and  perform  the 
boundary-following  step  once  again. 

Now,  we  thin  the  objects  in  each  node  in  parallel  by 
following  the  object  boundaries  clockwise  and  by 
shrinking  them  in  the  direction  perpendicular  to  the 
boundary  and  pointing  toward  the  inside  of  the  object. 
This  procedure  is  repeated  until  the  nunber  of  boundary 
points  is  not  changed.  Note  that  we  find  the  direction 
for  shrinking  based  on  the  eight  boundary-following 
directions  shown  in  Figure  3  and  defined  by  the 
following  equation: 

shrink_dir  =  (follow_dir  ♦  2)  mod  8  (3) 

where  if  8hrink_dir  is  zero,  then  shrink_dir  is 
reassigned  to  eight. 

Figure  4  shows  the  result  of  shrinking  a  simple  cross 
object  after  only  one  iteration.  The  starting  point  is 
the  top-left  position  of  the  object,  and  the  arrows 
represent  the  boundary- foil owing  direction  which  is 
clockwise.  'x'  and  '.'  denote  the  boundary  of  the 
original  object  and  the  boundary  of  the  shrunken  object, 
respectively.  Note  that  the  circled  points  are  inserted 


73 


to  make  the  shrunken  object  boundary  connect.  The 
connectivity  of  the  shrunken- object  boundary  is 
essential  for  the  next  iterations.  If  a  boundary  point 
is  an  element  of  a  parallel  line  or  a  single  line 
(overlapped)  then  the  point  is  just  copied  without 
shrinking.  The  parallel  line  and  the  single  line  are 
depicted  in  Figure  5(a)  and  S(b),  respectively. 

The  shrinking  step  produces  the  skeletons  of  the 
objects,  which  are  at  most  two-pixels  wide.  To  make 
single-pixel  wide  skeletons,  we  use  the  Zhang  and  Suen 
algorithm  [2]  which  preserves  the  connectivity  of  the 
skeletons.  Note  that  we  can  remove  two-pixel-wide 
points  by  following  the  skeleton  data  points  instead  of 
scanning  all  over  the  working  plane.  Also  note  that  we 
need  only  one  iteration,  that  is,  two  subiterations. 


Figure  3.  Eight  boundary-following  directions. 


4.  Experimental  Results 

Our  parallel-thinning  algorithm  was  implemented  on  a 
hypercube-topology  multiprocessor  computer,  the  Intel 
iPSC/2.  Figure  6(a)  shows  a  test  image  which  contains 
sixteen  'H's.  The  size  of  the  image  is  512x512. 
According  to  the  input  image  distribution  method 
discussed  in  section  2.1,  the  input  image  was  divided  by 
the  number  of  nodes  and  each  sii>-image  was  distributed 
to  each  node.  Then  the  boundary  detection  for  each  sub¬ 
image  was  performed  in  parallel.  Figure  6(b)  shows  the 
result  of  the  boundary  detection  from  the  original 
image.  Through  the  parallel  linking  procedure,  the 
sixteen  objects  were  extracted  and  redistributed  to  the 
nodes  as  evenly  as  possible.  Finally,  the  objects  were 
thinned  by  boundary- following  and  shrinking.  Figure 
6(c)  is  the  final  result.  The  skeletons  are  single¬ 
pixel  wide. 

For  the  comparison,  we  also  implemented  Zhang  and 
Suen's  algorithm  on  the  iPSC/2.  As  we  discussed  in  the 
introduction,  we  needed  to  swap  data  between  nodes  at 
every  iteration.  The  processing  times  of  Zhang  and 
Suen's  algorithm  as  well  as  our  algorithm  according  to 
different  numbers  of  nodes  are  shown  in  Table  1.  The 
data  described  above  was  used  for  both  cases.  Ue  can 
see  that  our  algorithm  is  much  faster  than  Zhang  and 
Suen's.  But  our  algorithm  has  some  degradation  when 
using  32  nodes  because  there  are  only  16  objects  in  the 
image  and  more  communication  time  is  needed  for  object 
extraction  as  more  nodes  are  used. 


5.  Conclusions 

In  most  conventional  thinning  algorithms,  the 
complexity  of  the  whole  procedure  is  O(n^).  Moreover, 
the  algorithms  are  undesirable  for  distributed  memory 
computers.  To  solve  these  problems,  we  have  presented 
a  new  parallel-thinning  algorithm  which  is  based  on 
boundary  following  and  shrinking.  Also  we  have 
implemented  this  algorithm  on  the  Intel  iPSC/2.  . 
Theoretically,  the  complexity  of  our  algorithm  is  O(n^). 
According  to  the  experimental  results,  our  algorithm  is 
much  faster  than  Zhang  and  Suen's.  However,  our 
algorithm  depends  on  the  number  of  objects  in  the  input 
image.  That  is,  the  more  objects  in  the  scene,  the  more 
efficient  the  algorithm.  Another  problem  is  that,  so 


Figure  4.  The  result  of  shrinking  a  sisple  cross 
object  after  one  iteration. 


(a)  (b) 

Figure  5.  (a)  Parallel  line 

(b)  Single  line  (overlapped) 
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Figure  6.  (a)  An  origins'  ’nput  image  (512x512) 

(b)  A  boundary-detected  image 

(c)  A  thinned  image 
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Abstract 

fhe  problem  of  tracking  multiple  targets 
in  the  presence  of  displacement  noise  and  clut¬ 
ter  is  formulated  as  a  nonconvex  optimization 
problem.  The  form  of  the  suggestetl  cost  func¬ 
tion  is  shown  to  be  suitable  for  the  Graduated 
Non-Convexity  algorithm,  which  can  be  viewed 
as  deterministic  annealing.  The  m<  ;hod  is  first 
derived  for  the  two-dimensional  (s  <atial/  tem¬ 
poral)  case,  and  then  generalized  to  the  multi¬ 
dimensional  case.  The  complexity  grows  linearly 
with  the  number  of  targets.  Computer  simula¬ 
tions  show  the  performance  with  crossing  tra¬ 
jectories. 

Introduction 

The  problem  of  tracking  is  that  of  esti¬ 
mating  the  trajectories  of  moving  (point)  ob¬ 
jects  given  a  set  of  noisy  measurements  in  time. 
Many  approaches  have  been  suggested  for  track¬ 
ing,  some  of  which  will  be  briefly  mentioned  here 
so  as  to  clarify  the  relationship  between  them 
and  our  new  method. 

Two  types  of  noise  are  assumed  present, 
namely,  displacement  noise  and  clutter.  The 
displacement  noise  corresponds  to  ■  rrors  in  the 
location  of  returns  with  respect  t*  the  actual 
locations  of  targets.  Clutter  consi.its  of  noisy 
points  which  do  not  relate  to  an  exi.sting  target. 
If  only  displeicement  noise  were  present,  then  the 
problem  would  reduce  to  that  of  curve  fitting  to 
minimize  some  appropriate  measure.  In  particu¬ 
lar,  if  the  target  dynamics  could  be  modelled  by 
state  space  equations  driven  by  Gaussian  white 
noise,  then  the  Kalman  filter  recursive  solution 
could  be  used  to  minimize  the  mean  squared  er¬ 
ror  criterion. 

The  presence  of  clutter,  however,  adds  a 
data  association  aspect  to  the  problem,  i.e. 
which  of  the  observed  returns  corresponds  to 
the  target.  The  Probabilistic  Data  Association 
method  [2]  overcomes  this  difficulty  by  consider¬ 


ing  only  the  most  likely  associations  and  assign¬ 
ing  an  association  probability  to  each  hypothe¬ 
sis.  The  method  outputs  as  state  estimate  the 
corresponding  average  of  the  conditional  state 
estimates.  Another  difficulty  arises  when  deal¬ 
ing  with  multiple  targets.  Unlike  clutter,  the 
presence  of  another  target  produces  points  with 
a  structured  distribution.  Thus  in  the  case  of 
crossing  targets,  these  points  may  be  assigned 
high  association  probabilities  and  mislead  the 
estinVator.  This  gave  rise  to  the  Joint  Proba¬ 
bilistic  Data  Association  method  [3],  which  as¬ 
signs  joint  association  probabilities  to  sets  of 
hypotheses.  The  complexity  of  this  method 
clearly  grows  combinatorially.  A  neural  network 
method  for  approximating  the  joint  association 
probabilities  has  recently  been  proposed  [4]. 

An  interesting  approach  to  tracking  is  by 
using  Dynamic  Programming  [5].  Here  the  space 
is  discretized,  and  a  full  search  through  all  possi¬ 
ble  states  is  efficiently  performed  by  exploiting 
special  properties  of  the  problem.  The  advan¬ 
tage  of  the  method  is  that  for  a  single  target,  it 
will  always  find  the  optimal  trajectory  (within 
the  resolution  of  the  grid).  On  the  other  hand, 
the  ability  to  resolve  crossing  targets  is  deter¬ 
mined  by  the  resolution  since  two  trajectories 
passing  through  the  same  state  will  be  merged 
by  the  search  procedure.  Refining  the  resolution 
clearly  affects  the  complexity. 

Hough  IVansform  methods  have  also  been 
suggested  for  tracking  [6].  The  Hough  TYans- 
form  detects  trajectories  belonging  to  a  speci¬ 
fied  family  of  parametrized  curves,  by  a  voting 
procedure.  It  is  relatively  insensitive  to  clutter, 
but  quite  sensitive  to  displacement  noise.  Much 
work  has  been  devoted  to  reduce  the  complex¬ 
ity  of  the  multi-dimensional  Hough  IVansform. 
It  can  naturally  be  used  as  a  track  initiator  for 
another  tracking  method,  by  detecting  possi¬ 
ble  trajectories  within  small  windows  in  the  raw 
data. 
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In  this  paper  a  new  approach  is  proposed. 
First,  the  tracking  problem  is  reformulated  as  a 
non-convex  optimization  problem,  i.e.,  the  mini¬ 
mization  of  an  appropriate  energy  function.  The 
form  of  this  energy  function  is  then  shown  to  be 
suitable  for  an  algorithm  based  on  the  princi¬ 
ple  of  Graduated  Non-Convexity  ((iNC)  which 
was  proposed  for  visual  reconstruci  ion  [1].  We 
propose  a  deterministic  algorithm  which  enables 
avoiding  local  minima.  In  fact  the  energy  func¬ 
tion  is  replaced  by  a  sequence  of  energy  func¬ 
tions  which  converges  to  the  original  energy 
function.  The  sequence  starts  with  a  convex 
energy  function  and  gradually  introduces  non¬ 
convexity  as  it  approaches  the  final  energy  func¬ 
tion. 

The  method  is  first  developed  for  the  two- 
dimensional  (space/time)  case,  and  then  gener- 
aii'ed  to  deal  with  the  n-dimensional  case.  Sim¬ 
ulation  results  are  shown  to  demonstrate  the 
performance.  Finally,  issues  of  possible  paral¬ 
lel  implementation,  notably  in  terms  of  cellular 
automata,  are  discussed. 


The  two  dimensional  (time/space) 
derivation 

As  stated  in  the  introduction,  the  prob¬ 
lem  is  made  hard  by  its  data  association  aspect, 
i.e.,  which  point  is  associated  with  which  tar¬ 
get.  In  fact,  if  we  knew  the  correct  data  as¬ 
sociation  we  could  easily  compute  the  optimal 
trajectory  since  the  energy  function  would  be 
convex.  Moreover,  the  analytic  solution  could 
be  given  in  terms  of  Green  functions.  The  ap¬ 
proach  in  this  study  will  be  to  implicitly  look 
for  the  set  of  points  to  associate  n  ith  a  target 
so  as  to  minimize  the  energy.  From  such  a  view¬ 
point,  if  one  considers  all  possible  trajectories 
for  a  target,  one  should  compute  its  energy  af¬ 
ter  assigning  to  it  the  nearest  returns. 

The  proposed  energy  functii^n  for  two- 
dimensional  (spatid-temporal)  data  is 


E-  ^mm{(ui  - 


(1) 


where  Ui  is  the  trajectory  location  at  time  i, 
is  the  j’th  data  point  at  time  i,  and  ii  is  some 
prediction  of  the  trajectory  given  past  data  or 
other  external  information  such  as  other  sensors 
etc.,  which  may  be  nonuniformly  weighted  in 


time  (ffj).  The  first  term  of  the  energy  func¬ 
tion  measures  the  trajectory’s  distance  from  the 
observed  data.  The  second  term  penalizes  non¬ 
smooth  trajectories.  The  third  term  takes  into 
account  predictions  and  allows  adding  external 
information. 

This  energy  function  has  many  local  min¬ 
ima  because  of  the  first  term  which  is  not  con¬ 
vex,  and  indeed,  the  first  term  contains  the  data 
association  problem.  Reconsider  the  first  term, 

Ei='^giiui),  (2) 

i 


where 

Pi(x)  =  min{(j;  -  (3) 

j 

We  wish  to  find  a  convex  approximation  E*  to 
the  energy  function,  and  we  shall  do  it  by  replac¬ 
ing  the  functions  gi  by  some  gl.  The  condition 
for  convexity  is  that  the  Hessian  be  positive  def¬ 
inite.  The  Hessian  of  E’  is  given  by 


Hiiin) 


d^E* 

duiduj 


(4) 


where  Sij  is  the  Kronecker  delta  and  Q  is  the 
matrix  given  by 

r  2,  if  »  =  j; 

=  if|i-i|  =  l;  (5) 

1 0,  otherwise. 


Since  the  matrix  is  positive  semidefinite, 
then  by  requiring  the  diagonal  matrix  represent¬ 
ing  the  first  and  third  terms  in  (4)  to  be  positive 
definite  we  ensure  that  so  is  H  and  therefore  E* 
is  convex.  Hence  we  require 


dx^  ~  arf 


-Ci. 


(6) 


The  functions  gi  are  piecewise  parabolic  as  il¬ 
lustrated  in  Figure  1.  The  best  approximat¬ 
ing  functions  (from  below)  g^  which  satisfy  (6) 
are  obtained  by  fitting  inverted  parabolas  to  the 
boundaries  as  shown  in  Figure  1.  These  func¬ 
tions  are  differentiable  and  their  derivative  is 
continuous.  Between  two  detected  points,  2d 
apart,  we  get  (assuming  the  origin  is  at  the  mid¬ 
point) 


( 1*1  <  1+^; 

\  (d  —  xy,  otherwise. 


(7) 


Note  that  for  c  =  Ci  we  get  the  convex  approx¬ 
imation  we  needed,  while  on  the  other  hand  for 
c  — »•  oo  we  get  g*  —*  gi  and  therefore  E*  -*  E. 
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One  may  therefore  choose  c  as  a  natural  pa¬ 
rameter  for  gradually  introducing  non-convexity 
into  the  energy  function.  This  is  not  the  only 
possibility,  another  choice  of  parameter  which 
is  closely  related  to  multi-scale  methods  is  cur¬ 
rently  under  investigation. 

The  algorithm  will  therefore  be  in  the  fol¬ 
lowing  lines.  Initialize  c  =  c,  so  that  the  en¬ 
ergy  function  is  convex,  and  optimize  using  your 
favorite  method  (e.g.  gradient  descent).  At 
each  iteration  increase  c  to  introduce  some  non¬ 
convexity  and  re-optimize.  An  important  issue 
is  that  of  where  to  stop  the  iterations.  Recall 
that 

9Ux)<9i{x),  Vx  (8) 

which  implies  that  if  the  configuration  u’  glob¬ 
ally  minimizes  E*,  then 

£*(«•)  <  E(u),  Vu.  (9) 

Hence,  u*  is  the  global  minimum  of  E  if  and 
only  if 

E{u*)  =  E^(u-).  (iO) 

In  certain  cases  it  turns  out  that  the  convex  ap¬ 
proximation  is  already  a  good  enough  approxi¬ 
mation  of  the  energy  function  (this  depends  on 
the  choice  of  parameters)  so  that  (10)  holds  and 
the  global  minimum  is  found.  Moreover,  if  by 
choosing  a  carefui  schedule  for  up<lating  c  we 
can  ensure  that  we  are  always  at  the  global  min¬ 
imum  of  E' ,  then  whenever  we  reach  a  config¬ 
uration  which  satisfies  (10),  we  ha' e  found  the 
global  minimum  of  E.  Note  that  ea<  h  of  the  in¬ 
tervals  over  which  gi  ^  g"  is  made  imaller  aa  c 
is  increased. 

Generalization  to  the  n-diincnsional 
space 

The  generalization  will  be  given  for  the  n- 
dimensional  spatial  case  and  illustrated  for  the 
two  dimensional  spatial  case.  The  main  issue 
here  is  to  produce  a  convex  approAimation  to 
the  energy  function.  Once  we  have  that,  we  shall 
immediately  see  how  to  introduce  non-convexity. 

Again  let  us  consider  the  first  term  of 
the  energy  function.  It  is  a  set  of  paraboloids 
centered  at  the  data  points.  Over  a  two- 
dimensional  space,  the  energy  function  looks  like 
an  irregular  egg  tray.  The  boundaries  within 
which  each  paraboloid  is  defined  a  e  given  by 
the  appropriate  Voronoi  diagram.  This  is  a  set 
of  hyperplanes  which  encloses  will  each  data 
point  all  the  points  in  space  which  are  nearest 


to  this  data  point.  In  order  to  obtain  the  con¬ 
vex  approximation  we  smooth  the  function  over 
these  boundaries  to  satisfy  the  second  derivative 
requirements.  Similarly  to  the  one-dimensioneil 
case  (7),  the  function  is  modified  within  a  sleeve 
around  the  boundary  hyperplanes. 

The  form  of  the  approximating  function  at 
a  given  point  will  depend  on  the  number  of  data 
points  associated  with  it.  For  the  case  of  two  di¬ 
mensional  space,  a  point  is  associated  with  two 
data  points  if  it  is  in  a  sleeve,  and  with  three 
data  points  if  it  is  in  the  intersection  of  two 
sleeves.  In  general  each  point  may  be  associated 
with  up  to  n  -I- 1  data  points  (excluding  patholo¬ 
gies).  Now  suppose  that  we  are  in  a  zone  that  is 
associated  with  Ar  -t- 1  data  points.  These  points 
are  eill  in  a  k-dimensional  subspace.  Moreover, 
assuming  they  are  ’’generally  positioned”,  i.e., 
no  (k-l)-dimensional  subspace  contains  all  of 
them,  then  they  are  on  some  k-dimensional  hy¬ 
persphere  (which  will  be  simply  referred  to  as 
sphere). 

The  approximating  function  is  defined  as 
an  inverted  paraboloid  over  the  k-dimensional 
space  centered  at  the  center  of  the  bounding 
sphere,  and  an  upright  paraboloid  in  the  remain¬ 
ing  orthogonal  directions. 

g;{xu...,xn)  =  K +  Y 

i=i  j=k+\ 

where  K  is  a  constant  to  be  determined,  A:  -b  1 
is  the  number  of  data  points  associated  with 
(xi,...,x„).  These  data  points  are  in  fact 
the  vertices  of  a  hyperpolyhedron  in  the  k- 
dimensional  space.'  The  intersection  of  the  cor¬ 
responding  sleeves  is  a  smaller  polyhedron  con¬ 
gruent  to  it  whose  bounding  sphere  has  the  same 
center  (see  Fig.  2  for  the  two-dimensional  case). 
Let  M  be  the  radius  of  this  hypersphere,  then 

A' =  cR^ -f  5,(u)  (12) 

where  v  will  stand  for  any  of  the  vertices  of 
the  sleeve  intersection.  Note  that  p,  has  the 
same  value  for  alt  these  vertices  (equally  distant 
from  data  points).  Note  also  that  for  the  one¬ 
dimensional  spatial  case  we  obtain  (7)  from  (11) 
and  (12)  by  substituting  R  =  d/(l  +c)  which  is 
indeed  the  radius  of  the  one-dimensional  bound¬ 
ing  sphere,  i.e.  half  the  distance. 

We  shall  omit  the  details  here  but  it  is  not 
difficult  to  show  that  the  approximating  func¬ 
tions  3*  are  continuous,  differentiable  and  their 


to 


derivative  is  continuous  everywhere.  Such  an 
approximating  function  is  shown  in  Fig.  3. 

We  have  generalized  our  convex  approxi¬ 
mation  to  mulii-dimensional  spaces,  and  the  re¬ 
sulting  energy  can  still  be  naturally  pcirametrized 
by  c  to  introduce  non-convexity. 

On  parallel  implementation  of  the 
algorithm 

In  the  previous  sections  we  have  con¬ 
structed  the  sequence  of  energy  functions  which 
starts  with  a  convex  approximation  and  con¬ 
verges  to  the  original  energy  function.  How¬ 
ever,  the  actu2il  computation  in  the  algorithm 
does  not  involve  evaluating  these  functions  ev¬ 
erywhere.  All  that  is  required  at  each  iteration 
is  to  evaluate  the  derivative  with  respect  to  each 
variable  at  the  current  point.  As  the  g*  func¬ 
tions  are  defined  by  cases,  the  main  problem  is 
to  establish  the  case,  i.e.,  with  hov  many  and 
which  data  points  it  is  associated.  By  geomet¬ 
rical  considerations,  it  can  be  shown  that  given 
the  current  point  and  the  nearest  data  point,  all 
that  is  required  is  to  search  a  certain  window  for 
additional  data  points.  The  window  is  defined 
as  the  difference  of  two  hyperballs  B  —  b,  where  6 
is  a  ball  centered  at  the  current  point  and  whose 
radius  is  the  distance  to  the  nearest  data  point 

r  =  |x-d(>)|.  (13) 

The  larger  ball  B  is  the  interior  of  a  sphere  pass¬ 
ing  through  the  data  point,  whose  center  is  on 
the  line  connecting  the  two  points,  and  whose 
radius  is 


This  is  illustrated  in  Fig.  4.  The  data  points 
found  in  the  crescent  B  —  b  will  determine  the 
form  of  Qi . 

Let  us  reconsider  the  energy  fu  iction  given 
in  (1).  As  there  is  no  interaction  between  tracks, 
the  complexity  grows  linearly  with  the  number 
of  tracks.  In  fact,  all  trajectories  can  be  com¬ 
puted  in  parallel.  We  shall  next  discuss  possi¬ 
ble  parallelization  of  the  computation  of  a  sin¬ 
gle  trajectory.  The  second  observation  to  make 
is  that  trajectories  are  temporally  but  not  spa¬ 
tially  discretized. 

The  fact  that  the  trajectories  are  not  dis¬ 
cretized  in  space  allows  avoiding  apriori  limi¬ 
tations  on  resolving  crossing  targets.  The  dis¬ 
cretization  of  space  into  a  large  number  of  mutu¬ 
ally  exclusive  states  is  typical  for  neural-network 


formulations  and  dynamic  programming  meth¬ 
ods. 

On  the  other  hand,  the  discretization  in 
time  which  is  assumed  to  be  property  of  the 
input,  enables  a  parallel  implementation.  This 
can  be  done  by  a  network  of  processors,  each 
in  chrirge  of  a  given  time  slice.  The  only 
inter-processor  communication  is  within  small 
neighborhoods,  through  the  smoothness  term  of 
the  energy  function  which  contains  a  temporal 
derivative.  It  is  therefore  natural  to  visualize 
such  a  system  in  terms  of  cellular  automata. 
For  a  given  time  window,  each  cell  processes  one 
time  slice  data  while  incorporating  into  the  pro¬ 
cessing  the  output  of  its  defined  neighbors. 

In  order  to  eliminate  the  need  to  treins- 
fer  input  data  between  processors  as  the  time 
window  slides,  a  cyclic  index  rotation  is  used. 
The  processors  are  connected  in  a  circle,  and  the 
connection  is  severed  between  the  last  and  the 
first  time  slices  in  the  window.  As  the  window 
slides  by  one  time  unit,  all  indices  are  rotated  so 
that  each  processor  still  deals  with  the  same  in¬ 
put  data,  but  advances  within  the  window.  The 
processor  which  wets  last  now  becomes  first  and 
receives  fresh  input.  Note  that  the  disconnected 
branch  is  also  rotated  to  be  between  the  current 
last  and  first  time  slice  in  the  window. 

Simulation 

A  simulated  example  is  shown  in  Fig.  5. 
There  is  one  spatial  dimension  and  one  temporal 
dimension.  Five  crossing  targets  are  detected  in 
the  presence  of  clutter  and  displacement  noise. 
The  targets  were  generated  by  specifying  initial 
positions  and  velocities,  and  applying  small  ac¬ 
celeration  noise  to  them  at  each  time  unit. 

Summary 

A  nonconvex  cost  optimization  approach  is 
suggested  for  multitarget  tracking  in  the  pres¬ 
ence  of  displacement  noise  and  clutter.  The 
method  is  based  on  deriving  a  convex  approxi¬ 
mation  to  the  energy  function  and  gradually  in¬ 
troducing  nonconvexity.  By  this  procedure  we 
start  with  the  global  minimum  of  the  approxi¬ 
mated  energy  function,  and  perform  "tracking” 
of  the  global  minimum  while  varying  the  non¬ 
convexity  parameter.  In  this  respect  the  method 
can  be  viewed  as  deterministic  annealing.  The 
convex  approximation  was  derived  for  the  two- 
dimensional  (spatial/temporal)  case  and  then 
generalized  for  multi-dimensional  cases.  The 
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computations  can  be  performed  in  parallel  per 
track  and  per  time  slice.  A  simulated  example 
is  presented  to  demonstrate  the  performance  of 
the  method. 
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1  Overview 

Simulation89  is  an  emulation  of  various  SDI  tasks 
(tracking,  engagement  management  and  ‘look  ahead’) 
developed  for  the  U.  S.  Air  Force.  The  simulation 
presently  deals  with  the  boost,  post-boost  and  early 
midcourse  phases  of  a  ’mass  raid’  scenario,  and  is  de¬ 
signed  to  process  scenarios  with  a  few  thousand  tar¬ 
gets.  The  simulation  is  run  on  the  Mark-111  hyper¬ 
cube,  with  individual  tasks  performed  on  subcubes 
of  the  full  hypercube.  In  general,  the  computations 
within  individual  subcubes  are  done  in  a  synchronous 
manner  (i.e.,  CrOS),  while  communications  between 
tasks/subcubes  are  done  asynchronously. 

The  nominal  task  for  the  tracking  module  is  to  pro¬ 
vide  state  information  on  individual  targets,  given  2D 
line  of  sight  data  from  various  sensors  at  regular  time 
intervals.  This  task  is  complicated  by  means  of  a  few 
relevant  additional  requirements: 

1.  In  the  initial  boost  phase,  the  trajectories  of  in¬ 
dividual  targets  are  not  fully  known. 

2.  The  overall  system  must  scale  in  such  a  way  that 
increases  in  the  size  of  the  underlying  scenario 
are  accomodated  by  (proportional)  increases  in 
the  size  of  the  tracking  sub-cube. 

3.  The  tracker  must  meet  ‘real  time’  requirements. 

The  first  requirement  in  fact  dictates  the  gross  over¬ 
all  structure  of  the  tracking  package,  as  illustrated 
in  Fig.(l).  A  single  tracking  system  is  formed  from 
two  elementary  2D  tracking  subsystems.  Each  2D 
tracking  sub-system  processes  individual  data  from  its 
own  associated  sensor,  forming  lists  of  plausible  mono 
tracks  through  these  data  sets.  These  2D  tracks  are 
then  shared  between  the  two  2D  sub-systems,  and  a 
single  set  of  3D  treicks  is  formed. 

The  tracking  models  used  for  the  2D  and  3D  sub¬ 
systems  are  quite  different.  According  to  the  first  re¬ 
quirement  listed  above,  it  must  be  assumed  that  the 
data  from  a  single  sensor  are  insufficient  to  resolve 
all  lYack^Hit  ambiguities.  As  a  consequence,  the  2D 
systems  use  a  Multiple  Hypothesis  formalism  in  which 
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Figure  1:  Schematic  IVacker  Organization 


many  candidate  tracks  through  a  single  sensor  datum 
are  allowed.  Such  a  model  is  subject  to  exponential 
explosions  of  the  overall  track  file.  This  fundamental 
difficulty  is  resolved  by  a  number  of  rules  for  pruning 
the  size  of  the  overall  track  file.  In  particular 

1.  Two  tracks  ending  on  a  given  datum  are  said  to 
be  equivalent  if  they  share  the  same  2D  data  over 
the  last  four  scans. 

2.  The  number  of  inequivalent  tracks  per  datum  is 
limited  by  a  cutoff  parameter. 

3.  The  total  number  of  2D  tracks  is  also  limited  by 
a  global  cutoff. 

If  two  tracks  in  the  system  are  found  to  be  equiva¬ 
lent  according  to  point  1,  one  of  the  tracks  is  simply 
deleted.  As  is  discussed  below,  the  task  of  identifying 
equivEiIent  tracks  in  the  distributed  2D  track  file  dic¬ 
tates  the  maner  in  which  the  2D  tracking  problem  is 
decomposed  for  concurrent  execution. 

Unlike  the  2D  tracking  system,  the  3D  tracker 
in  Fig.(l)  maintains  (at  most)  one  track  per  sen¬ 
sor  data  point,  representing  the  best  global  interpre¬ 
tation  of  tracks  through  the  data  (this  single  ‘best 
guess’  answer  is  the  output  of  the  tracker  expected 
by  the  other  elements  of  Sim89).  In  place  of  the 
Multiple-Hypothesis  model  used  for  2D  tracking,  the 
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3D  tracker  is  based  on  Optimal  Associations.  These 
optimal  associations  in  fact  come  in  two  distinct 
forms; 

1.  For  track  extensions,  the  predicted  data  positions 
of  individual  3D  tracks  are  associated  with  actual 
data  from  the  two  sensing  subsystems  of  Fig.(l). 

2.  For  3D  track  initiations,  2D  tracks  form  the  two 
subsystems  of  Fig. (1)  are  associated  according  to 
values  of  projections  onto  an  association  refer¬ 
ence  axis  (so-called  ‘Hinge  Angle’  associations). 

The  adoption  of  an  Optimal  Association  formalism 
essentially  trivializes  the  concurrent  decomposition  of 
the  3D  tracker  :  the  3D  tracks  are  distributed  among 
the  nodes  of  the  subcube  in  such  a  way  that  the  num¬ 
ber  of  tracks  per  node  is  constant.  The  challenge  of 
concurrent  3D  tracking  comes  entirely  in  performing 
the  two  types  of  optimal  associations. 

A  general  concurrent  algorithm  for  optimal  associ¬ 
ations  is  described  in  Ref.[l].  However,  the  resource 
requirements  for  the  general  optimal  association  prob¬ 
lem  (AT  oc  for  NxN  association  problems)  is 
such  that  a  straightforward  use  of  the  general  associ¬ 
ation  formalism  is  completely  inappropriate.  Instead, 
the  concurrent  association  algorithm  proceeds  as  fol¬ 
lows: 

1.  Each  node  computes  a  list  of  associations  keys 
(i.e.,  projections  onto  an  appropriate  reference 
axis),  for  all  items  in  its  local  track  list. 

2.  The  distributed  lists  of  association  keys  are  glob¬ 
ally  sorted. 

3.  The  sorted  lists  are  divided  into  a  number  of  sub¬ 
blocks,  determined  by  appropriately  large  gaps  in 
the  lists  of  keys. 

4.  The  sub-blocks  are  assigned  to  individual  nodes 
and  the  assignment  problems  for  sub-blocks  are 
solved  using  a  modified  ‘sparse’  formalism  of  the 
general  assignment  problem. 

This  procedure  is  efficient  as  long  as  the  number  of 
separate  sub-blocks  found  in  the  third  step  is  large 
compared  to  the  number  of  nodes  in  the  tracking  sub¬ 
cubes  (which  is,  empirically,  almost  always  the  case 
for  the  Sim89  problem). 

In  addition  to  the  central  tasks  of  Track  ♦-►Hit  and 
Track  ►-►Track  associations,  the  3D  tracker  also  evalu¬ 
ates  trajectory  fits  for  all  3D  tracks  in  the  system.  Un¬ 
like  the  predecessors  to  Sim89,  these  trajectory  fits  are 
not  essential  elements  of  tracking  per  se.  All  tracking 
is  done  using  kinematic  system  models.  The  trajec¬ 
tory  parameterizations  are  added  to  the  tracking  task 
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Figure  2:  Data  Set  and  TVack  File  Sizes  Versus  Scan 

for  purposes  of  corrununications;  The  3D  track  file 
structures  according  to  the  kinematic  model  are  huge 
(more  than  100  floating  point  numbers  per  track), 
and  the  model-dependent  parameterizations  greatly 
reduce  the  sizes  of  treick  file  messages  passed  between 
subcubes  of  the  full  Sim89  simulation.  The  concur¬ 
rent  estimation  of  track  parameters  is  again  trivial, 
with  each  node  independently  performing  this  task 
for  its  own  subset  of  the  global  track  file. 

2  Some  Sample  Results 

This  section  briefly  examines  some  typical  results  of 
the  Sim89  tracker  for  a  standard  input  set.  The  threat 
scenario  involves  200  primary  targets,  each  of  which 
ultimately  spawns  10  daughter  objects  (RV’s).  The 
targets  are  launched  from  six  separated  launch  sites 
over  a  two  minute  time  window.  The  primary  threat 
is  preceeded  by  a  simultaneous  launch  of  sixty  sec¬ 
ondary  targets  (ASAT’s). 

Sizes  of  the  data  sets  and  2D  and  3D  track  files  are 
plotted  versus  scan  number  in  Fig. (2).  The  peaks  near 
scan  40  are  due  to  interception  of  the  ASAT’s,  while 
the  prolonged  increase  in  object  counts  after  scan  80 
is  due  to  gradual  deployment  of  RV’s  from  the  sur¬ 
viving  primary  targets.  As  expected,  the  number  of 
2D  tracks  greatly  exceeds  the  actual  number  of  tar¬ 
gets.  The  ‘kinks’  in  the  2D  track  counts  for  large  scan 
numner  are  the  result  of  the  automatic  reductions  in 
tracks/datum  cutoffs  mentioned  in  Section  1. 

The  number  of  3D  tracks  is  very  close  to  the  actual 
number  of  targets.  The  histogram  in  Fig.(3)  shows 
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the  percentage  of  targets  in  track, 

Ppn  TYack]  =  V[3D  TVacksl/ArfData] 

versus  scan  number.  Once  the  primary  targets  are 
into  second  stage  (about  scan  50),  the  percentage  in 
track  is  excellent.  Also  shown  in  Fig. (3)  are  the  frac¬ 
tions  of  pure  tracks 

Pj  =  A^[j-Scans  Correct]/A[3D  Tracks] 

where  the  numerator  is  the  number  of  3D  tracks  which 
(correctly)  incorporate  data  from  a  single  underlying 
target  through  the  past  j  scans. 

The  mild  degradations  in  both  percentage  in  track 
and  j-Scan  correct  tracks  between  scans  150  and  200 
are  due  to  the  successes  of  the  engagement  manage¬ 
ment  component  of  Sim89  in  intercepting  the  targets. 
The  disappearence  of  expected  targets  causes  some 
‘confusion’  for  'IVack<-»Hit  associations  on  subsequent 
tracking  scans. 

The  CPU  times  for  various  components  of  3D  track¬ 
ing  are  plotted  versus  scan  number  in  Fig. (4).  Most 
of  the  CPU  resources  are  spent  in  the  evaluation  of 
Track<-*Hit  associations.  With  the  exception  of  the 
‘confused’  scans  with  disapperaing  tr£icks,  the  CPU 
requirements  for  tracking  generally  scale  as  AT  oc 
NlogN  for  N  active  targets. 


Figure  4:  CPU  Step  Times  For  3D  TVacking 

3  Concurrent  Aspects 

The  task  of  multi-target  tracking  is  well-suited  for 
concurrent  execution,  with  most  of  the  ‘tracking’  per 
se  done  by  way  of  CPU-intensive  operations  involv¬ 
ing  individual  track  ♦-♦data  pairs  (the  filter  update  of  a 
single  3D  track  involes  more  than  one  thousand  float¬ 
ing  point  operations  ;  an  ideal  use  of  the  WEITEK 
coprocessor).  In  the  entire  tracking  program  (more 
than  35000  lines  of  code),  there  are  really  only  three 
general  concurrent  operations/aspects, 

1.  Global  collection  of  data  across  the  subcube. 

2.  Distributed  sorting. 

3.  Track  file  redistributions. 

with  each  of  these  teisks  occuring  in  a  variety  of  guises. 
The  sorting  task  is  done  using  the  basic  algorithm  of 
Ref. [2],  with  a  trivial  but  important  modification  to 
allow  empty  local  sublists  (empty  track  files  on  some 
nodes  occur  during  the  first  few  scans  of  the  tracking 
task). 

The  global  data  collections  are  all  done  using  a  sim¬ 
ple  loop  on  communication  channels: 

•  Set  Global  Value  To  Local  Value 
e  Loop  On  Communication  Channels 

-  Exchange  Values  Across  Channel 
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—  Update  Global  Value  Using  Input 

e  End  01  Channel  Loop 

This  simple  paradigm  is  used  throughout  the  code 
for  a  variety  of  purposes,  such  as  assessment  of  global 
status  (the  ‘Update’  task  is  a  logical  and  of  individual 
status  flags),  determining  global  file  sizes  (‘Update’  is 
simple  addition)  or  assessing  global  'IVack«-»Data  as¬ 
signments  (‘Update’  is  a  slightly  more  complicated 
merging  of  track  assignment  arrays  generated  on  in¬ 
dividual  nodes). 

For  the  3D  tracking  task,  concurrent  efficiency  re¬ 
quires  only  that  the  number  of  tracks  per  node  be 
approximately  constant.  Accordingly,  track  file  re¬ 
distribution  for  3D  tracking  is  done  using  a  simple 
variant  of  the  channel  loop  model: 

•  Loop  On  Communication  Channels 

-  Exchange  Track  File  Sizes 

-  Set  5  =  (Adhere  -  ^there)/2 

—It  6  >  0,  Send  6  Items  Over  Channel. 

-  11  6  <  0,  Receive  6  Items. 

e  End  01  Loop  On  Channels. 

After  the  exchanges  across  a  given  channel,  the 
number  of  items  on  each  half  of  the  subcube  with 
respect  to  that  channel  is  (approximately)  the  same, 
and  subsequent  loops  on  other  channels  do  not  mod¬ 
ify  this  equality.  At  the  end  of  the  channel  loop,  the 
tracks  are  equally  divided  across  all  channels  -  mean¬ 
ing  that  the  number  of  tracks  per  node  must  be  ap¬ 
proximately  the  same. 

The  only  aspect  of  the  full  tracking  program  which 
involves  concurrent  ‘subtleties’  is  the  redistribution 
of  the  the  global  2D  trcick  files.  Wasteful  cube-wide 
searches  for  equivalent  tracks  (same  sensor  data  over 
the  last  four  scans)  can  be  avoided  if  the  assignment  of 
tracks  to  nodes  is  done  according  to  a  single  essential 
requirement. 

At  the  start  of  each  2D  tracking  scan,  all 
tracks  ending  on  a  given  sensor  datum  are 
to  be  assigned  to  a  single  node. 

If  this  condition  is  satisfied,  then  searches  for  equiv¬ 
alent  tracks  need  only  be  done  locally.  The  require¬ 
ment  is  enforced  ^ls  follows; 

1.  Assign  each  datum  of  the  current  data  set  to  a 
specific  node. 

2.  Transfer  all  tracks  in  the  system  to  that  node 
which  ‘owns’  the  data  point  for  the  last  scan  in¬ 
cluded  in  the  track. 


This  redistribution  is  in  fact  done  as  the  last  step 
in  2D  processing  at  each  scan,  so  that  the  next  scan 
begins  with  the  basic  track  distribution  requirement 
satisfied. 

Once  data  points  (hence  tracks)  have  been  assigned 
to  individual  nodes,  the  actual  redistribution  of  the 
tracks  is  a  straightforward  application  of  the  basic 
CrystaLRouter  formalism  of  Ref.[3].  The  calculation 
of  destinations  for  individual  data  is  done  using  the 
following  simple  set  of  rules; 

1.  Each  datum  is  assigned  a  Weight,  taken  to  be  the 
total  number  of  tracks  in  the  system  wich  end  at 
that  datum. 

2.  Data  are  assigned  to  nodes  such  that  the  total 
Weight  per  node  is  approximately  constant. 

3.  If,  prior  to  the  redistribution,  a  particular  node 
already  contains  more  than  half  of  the  total 
weight  of  an  individual  datum,  then  the  datum  is 
assigned  to  that  node  -  provided  that  such  an  as¬ 
signment  does  not  violate  the  total  node  weight 
restrictions  of  point  2. 

4.  Unassigned  data  (i.e.,  data  without  tracks)  are 
assigned  to  nodes  in  a  simple  ‘Card  Dealing’  fash¬ 
ion. 

These  rules  are  easily  implemented  by  means  of  a  few 
simple  channel  loops  of  the  form  discussed  above. 

4  Conclusion 

The  hard  part  of  ‘Concurrent  Tracking’  is  the  tracking 
iteslf,  not  the  concurrency. 
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Abstract 

This  paper  describes  a  new  algorithm  for 
the  synchronization  of  a  class  of  parallel 
discrete  event  simulations  on  distributed 
memory,  parallel  computers.  Unlike 
previous  algorithms  which  synchronize 
on  a  per  process  basis,  this  algorithm 
synchronizes  on  a  per  processor  basis. 
The  algorithm  allows  full  generality  in 
the  simulation  model  by  allowing  dynamic 
process  creation  and  destruction  and  full 
inter-process  interconnections,  and  it  is 
shown  to  be  deadlock  and  livelock  free.  It 
has  been  used  to  simulate  very  large 
parallel  computer  architectures. 

This  algorithm  has  been  implemented  on 
the  Intel  iPSC/2  parallel  computer  system, 
and  its  performance  has  been  measured. 
The  algorithm  achieves  a  time  overhead  of 
0(1  og 2  N)  for  binary  hypercube  systems 
with  N  processors  and  0(D)  in  general, 
where  D  is  the  diameter  of  the  parallel 
system.  In  order  to  obtain  good  overall 
parallel  speedup,  the  algorithm  requires 
that  the  simulation  generate  at  least  0(N) 
events  at  each  simulated  time. 

Introduction 

As  discrete  event  simulations  increase  in 
size  and  complexity,  i:  becomes 
advantageous  to  execute  them  on  parallel 
architectures.  Parallel  architectures  offer 
the  processing  power  to  execute  multiple 
processes  concurrently,  thus  speeding  up 
the  simulation.  In  audition,  they  provide 
sufficient  physical  memory  to  hold  large 
simulation  models  without  suffering  the 


delays  required  to  access  backing  store  in 
virtual  memory  systems. 

In  order  to  avoid  bottlenecks  to  parallel 
speedup  [1],  parallel  synchronization 
algorithms  have  been  developed.  These 
algorithms  distribute  the  event 
scheduling  algorithm  among  parallel 
system's  processes,  which  eliminates  the 
hot  spot  in  memory  access  patterns 
arising  from  the  use  of  a  centralized  event 
scheduler,  a  problem  that  occurs  in  both 
shared  and  distributed  memory  parallel 
computers.  Two  principal  classes  of 
parallel  synchronization  algorithms  have 
emerged.  The  conservative  approach 
[1].[2],[3]  constrain  the  processes  to  handle 
incoming  events  in  strict  time  order. 
These  algorithms  take  steps  to  ensure  time 
order  before  processing  events.  The 
optimistic  approach  [4]  allows  events  to  be 
handled  in  their  order  of  arrival  and 
provides  rollback  and  recovery 
mechanisms  to  handle  events  processed 
out  of  order  in  time. 

Algorithms  developed  for  both  approaches 
distribute  the  synchronization  algorithm 
on  a  per  process  basis,  thereby  allocating 
one  logical  clock  and  event  scheduler  to 
each  process  in  the  simulation  model.  By 
so  doing,  these  algorithms  incur  a  total 
time  and  memory  overhead  to  advance  all 
process  clocks  one  time  interval  that  is  at 
least  proportional  to  the  number  of 
processes,  P,  in  the  system;  denote  this 
overhead  the  synchronization  overhead. 
Conservative  algorithms  may  suffer  time 
overheads  much  greater  than  P, 
depending  on  the  connectivity  of  the 
inter-process  communication  graph.  This 
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occurs  because  each  process  must 
determine  the  state  of  all  processes  that 
may  send  it  an  event  message  before  it  can 
safely  advance  its  time.  Hence,  in  a  system 
with  N  parallel  processors,  these 
algorithms  incur  a  time  overhead  of  at 
least  0(P/N)  to  advance  all  P  clocks  one 
time  interval  each. 

Because  of  the  practical  need  to  maximize 
the  number  of  processes  executed  by  each 
processor  in  current  systems,  this 
synchronization  time  overhead  typically 
becomes  at  least  0(N).  With  today’s 
parallel  computer  technology,  it  is 
important  to  maximize  the  problem  size 
per  processor  in  order  to  amortize  the 
interprocessor  communication  delays  and 
thus  maximize  the  observed  speedup 
[5], [6].  In  systems  with  several  megabytes 
of  memory  per  processor,  such  as  the  Intel 
iPSC/2  system,  and  typical  process  memory 
requirements  of  2-6K  bytes,  one  can 
execute  multiple  thousands  of  processes 
per  processor  (i.e.,  P/N  >  1000).  Thus,  for 
systems  with  up  to  a  thousand  processors 
(i.e.,  N  <=  1000)  these  simulations  will 
encounter  a  time  overhead  of  at  least  0(N). 

This  paper  describes  a  new  parallel 
synchronization  algorithm  using  multiple 
synchronized  event  schedulers,  one  per 
processing  node  of  the  system.  The  use  of 
multiple  event  schedulers  was  described 
for  the  Yaddes  simulation  environment 
[7].  However,  the  Yaddes  mechanism  uses 
a  centralized  synchronizer  and  relies 
upon  knowledge  of  the  interconnection 
between  logical  processes.  It  also  requires 
that  N  *  (N-1)  messages  be  sent  per  time 
step,  which  results  in  an  overhead  of  at 
least  0(N)  to  advance  all  P  clocks  by  one 
time  step.  In  contrast,  the  algorithm 
described  here  has  been  demonstrated  to 
have  a  synchronization  overhead  of 
0(1  og2  N)  on  hypercube  based  parallel 
systems.  It  will  in  general  have  an 
overhead  of  0(D),  where  D  is  the  diameter 
of  the  parallel  system.  The  diameter  is 
defined  here  as  the  maximum  path  length 
between  the  root  and  leaf  nodes  within  a 
spanning  tree  of  all  processors;  it 
represents  the  number  of  time  steps 


required  to  either  broadcast  from  one 
processor  or  aggregate  data  from  all 
processors. 

The  next  section  describes  the  algorithm. 
The  following  section  provides  recent 
performance  measurements  of  the 
algorithm's  time  overhead.  This  algorithm 
has  been  implemented  as  part  of  the 
Interwork  II™  software  package  [8]  on  the 
Intel  iPSC/2  parallel  computer  system  [9]. 
It  has  been  used  to  simulate  very  large 
parallel  computer  architectures  [10]. 

The  Synchronization  Algorithm 

The  synchronization  algorithm  described 
here  uses  one  event  scheduler  per 
processor  of  the  system.  All  of  the 
processes  residing  on  a  given  processor 
share  the  same  event  scheduler,  which 
consists  of  a  logical  clock  and  a  time 
ordered  queue  of  pending  events.  These 
event  schedulers  are  synchronized  in  a 
conservative  manner  so  that  no  event 
scheduler  may  advance  its  clock  until  it 
can  be  sure  that  no  other  event  scheduler 
will  send  one  of  its  processes  a  lower  time 
event. 

In  order  to  allow  full  generality  to  the 
simulation  model,  the  event  schedulers  are 
fully  connected.  That  is,  processes  within 
the  simulation  can  send  messages  to 
arbitrary  other  processes.  This  eliminates 
the  need  to  describe  a  static  inter-process 
communication  graph,  as  in  other 
conservative  algorithms,  and  it  allows 
processes  to  be  dynamically  created  and 
destroyed.  These  characteristics  greatly 
simplify  the  description  of  complex 
applications.  Full  interconnection  of 
event  schedulers  also  simplifies  the 
partitioning  of  processes  to  processors.  A 
process  can  be  allocated  to  any  processor, 
and  it  may  send  an  event  message  to  any 
other  process  in  the  system. 

The  use  of  full  connectivity  between  event 
schedulers  requires  that  all  event 
schedulers  be  synchronized  to  the  same 
value  of  global  system  time.  The 
algorithm  synchronizes  the  event 
schedulers  using  multiple,  distributed 
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spanning  trees,  one  rooted  at  each  event 
scheduler,  to  collect  and  distribute  clock 
times  in  parallel  and  with  minimum  total 
delay.  The  event  schedulers  initialize 
their  clocks  to  time  zero  and  execute  their 
portion  of  the  simulation  model's 
processes  until  they  complete  all  events  at 
this  time.  After  each  event  scheduler 
becomes  quiescent,  it  exchanges  its  next 
event  time,  i.e.,  the  lowest  time  at  which  it 
can  next  process  an  event  or  infinity  if  no 
events  are  enqueued,  with  each  of  its 
neighbors  in  turn.  Each  scheduler 
determines  the  minimum  of  all  collected 
and  its  own  next  event  times.  In  this 
manner,  all  event  schedulers  thus 
determine  in  parallel  the  globally 
minimum  next  event  time.  The  event 
schedulers  advance  their  clocks 
accordingly  and  execute  all  events  at  the 
new  time.  The  algorithm  repeats  these 
actions  for  each  event  time. 

By  synchronizing  all  event  schedulers  to 
the  same  global  time,  the  algorithm 
ensures  that  no  process  can  send  a  lower 
time  event  to  another  process  in  the 
system.  This  guarantees  the  correctness 
of  the  algorithm.  The  algorithm  is 
deadlock  free  because  the  system  is 
guaranteed  to  advance  to  a  new  time  after 
all  event  schedulers  exhaust  their  events 
at  the  current  time,  which  they 
eventually  must  do.  It  is  also  livelock  free 
because  the  sequence  of  globally 
minimum  next  event  times  is  strictly 
increasing.  A  next  event  time  of  infinity 
can  only  be  reached  when  there  arc  no 
events  in  the  system  left  to  process. 

The  algorithm's  relative  simplicity  makes 
it  straightforward  to  implement.  However, 
implementations  of  this  algorithm  must 
handle  the  race  condition  that  occurs 
when  a  process  sends  an  event  message  to 
a  process  at  another  event  scheduler  just 
before  the  first  process's  event  scheduler 
becomes  quiescent.  If  the  second  event 
scheduler  has  already  become  quiescent 
and  reported  its  next  event  time,  its 
neighbor(s)  could  determine  an  erroneous 
globally  minimum  next  event  time  while 
an  active  event  message  at  the  current 
time  is  in  transit.  The  sending  process 


must  report  to  its  event  scheduler  that  the 
system  is  still  active  even  though  its  local 
event  queue  has  become  empty. 

Performance  of  the  Algorithm 

The  use  of  a  spanning  tree  to  collect  next 
event  times  incurs  a  time  overhead  of 

0(D),  where  D  is  the  diameter  of  the 

system.  The  broadcast  of  the  globally 
lowest  next  event  time  requires  a  second 
0(D)  time.  Thus  the  total  time  overhead  to 
advance  the  clock  is  0(D).  For  binary 

hypercube  systems,  such  as  the  Intel 
iPSC/2  system, 

D  =  log2  N. 

The  actual  performance  of  the  algorithm 
was  measured  on  an  Intel  iPSC/2  system 
with  from  one  to  32  processors.  Figure  1 
shows  the  average  time  to  advance  the 

clock  one  clock  interval  versus  the 
number  of  processors  for  an  early 
implementation  of  the  algorithm.  In 
order  to  measure  only  the 
synchronization  overhead,  the  event 
schedulers  handled  no  actual  events  per 
time  step.  The  graph  shows  that  the  time 
grows  logarithmically  with  the  number  of 
processors.  By  scaling  the  simulation 
model  to  maintain  a  constant  processor 
load  as  processors  are  added,  the  algorithm 
thus  becomes  more  efficient  for  larger 
systems  and  larger  simulation  models.  The 
algorithm's  time  overhead  remains 
constant  as  more  memory  and  thus  more 
processes  are  be  added  for  each  processor, 
and  its  fraction  of  the  total  simulation  time 
becomes  smaller.  Unlike  previous 
conservative  and  optimistic  approaches, 
the  algorithm’s  performance  is  also 
largely  independent  of  the  interprocess 
communication  patterns.  A  dependence 
may  occur  in  handling  the  previously 
cited  race  condition,  whose  frequency 
depends  on  the  communication  patterns, 
load  balance,  and  system  delays. 


Figure  1:  Time  synchronization  overhead 
per  time  step  versus  number  of  processors 


This  synchronization  algorithm  yields 
best  performance  for  a  limited  class  of 
discrete  event  simulation  models.  In  order 
to  obtain  good  overall  parallel  speedup, 
the  algorithm  requires  that  the  simulation 
generate  at  least  0(N)  events  at  each 
simulated  time  and  that  these  events  be 
well  load  balanced  across  the  processing 
nodes.  Symmetric,  discrete  time 
simulations,  such  the  simulation  of 
distributed  memory  architectures  or  VLSI 
circuits,  are  well  suited  to  this  scheduling 
mechanism.  Simulations  which  do  not 
meet  these  criteria  may  achieve  better 
overall  speedup  using  previously 
described  schemes,  despite  their  higher 
synchronization  overhead. 

Summary 

This  paper  has  described  a  new  algorithm 
for  the  synchronization  of  parallel 
discrete  event  simulations  on  distributed 
memory,  parallel  computers.  Unlike 
previous  algorithms  which  synchronize 
on  a  per  process  basis,  this  algorithm 
synchronizes  on  a  per  processor  basis. 
This  is  accomplished  by  grouping 
processes  within  one  event  scheduler  per 
processor  and  then  synchronizing  the 
event  schedulers  u^^ng  multiple, 
distributed  spanning  trees.  The  algorithm 
allows  full  generality  in  the  simulation 
model  by  allowing  dynamic  process 
creation  and  destruction  and  full  inter¬ 
process  interconnections,  and  it  was 
shown  to  be  deadlock  and  livciock  free.  It 


has  been  used  to  simulate  very  large 
parallel  computer  architectures. 

This  algorithm  has  been  implemented  on 
the  Intel  iPSC/2  parallel  computer  system, 
and  its  performance  has  been  measured. 
The  algorithm  achieves  a  time  overhead  of 
0(log2  N)  for  binary  hypercube  systems 
with  N  processors  and  0(D)  in  general, 
where  D  is  the  diameter  of  the  parallel 
system. 

In  order  to  obtain  good  overall  parallel 
speedup,  this  synchronization  algorithm 
requires  that  the  simulation  generate  at 
least  0(N)  events  at  each  time  step  and  that 
these  events  be  well  load  balanced  across 
the  processing  nodes.  Symmetric,  discrete 
time  simulations,  such  the  simulation  of 
distributed  memory  architectures  or  VLSI 
circuits,  are  well  suited  to  this  scheduling 
mechanism.  Future  investigations  will 
assess  the  applicability  of  this  algorithm 
to  continuous  time  simulations,  such  as  the 
simulation  of  air  traffic  [11].  In  these 
simulations,  it  may  be  possible  to  quantize 
the  simulation  clock  without  adversely 
affecting  the  accuracy  of  the  results. 
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Abstract 

A  discrete  event  simulation  model  of  air 
traffic  flow  within  the  United  States  has 
been  written  and  executed  on  the  Intel 
iPSC®/2  parallel  system.  The  simulation 
program  was  written  in  an  object  oriented 
manner  using  the  Interwork  II™ 
Concurrent  Programming  Toolkit.  This 
simulation  demonstrates  how  object 
oriented  programming  can  simplify  the 
design  of  complex  simulations  and  can 
simplify  the  effort  to  distribute  and 
balance  the  processing  load  on  distributed 
memory,  parallel  architectures,  such  as 
the  iPSC/2.  It  also  demonstrates  the 
capacity  of  these  architectures  to  solve 
very  large  simulation  problems. 

Introduction 

Discrete  event  simulation  techniques  can 
be  used  to  model  large  physical  systems 
and  thereby  predict  aspects  of  their 
behavior.  An  important  example  of  a 
large  discrete  event  simulation  is  the 
modeling  of  the  air  traffic  flow  within  the 
United  States.  Over  three  thousand 
commercial  flights  per  hour  cross  the 
skies,  and  their  motions  must  be  carefully 
controlled  to  avoid  congestion.  Modeling 
this  complex  system  allows  planners  to 
minimize  delays,  maximize  use  of  available 
resources  (such  as  airways,  runways,  and 
fuel),  and  anticipate  the  effects  of  weather 
and  mechanical  problems. 

Distributed  memory,  parallel  systems,  such 
as  the  Intel  iPSC/2  parallel  system  [1] 
provide  an  excellent  hardware  platform 
for  running  large  discrete  event 


simulations.  They  have  sufficient 
primary  memory  to  hold  very  large 
simulation  models;  this  avoids  the  delays 
encountered  in  paging  simulation  data  to 
and  from  backing  store.  They  also  have 
the  parallel  processing  power  to  exploit 
the  inherent  parallelism  of  the  simulation 
models.  For  example,  the  many  aircraft 
within  an  air  traffic  simulation 
independently  progress  along  their  flight 
paths  in  parallel.  This  provides  the 
opportunity  to  speedup  the  execution  of 
the  simulation. 

Object  oriented  programming  techniques 
offer  advantages  in  describing  discrete 
event  simulation  models.  These 
techniques  foster  the  construction  of 
modular  programs  by  associating  logically 
related  data  with  the  procedures  that 
manipulate  the  data.  This  allows  the 
programmer  to  build  new  data  types  that 
extend  the  set  of  types  provided  by  the 
compiler.  In  a  simulation  model,  the 
simulated  entities  are  conveniently 
described  as  obj  ct  types  which 
encapsulate  their  various  characteristics. 
In  the  air  traffic  simulation,  the  aircraft 
and  airports  are  modeled  as  separate  object 
types.  Instances  of  these  types  are 
dynamically  created  (and  destroyed) 
during  the  course  of  a  simulation  to 
represent  an  actual  physical  system.  This 
modular  approach  simplifies  the 
simulation's  structure. 

Object  oriented  techniques  also  simplify 
the  implementation  of  large  discrete  event 
simulation  models  on  parallel  systems. 
The  modular  decomposition  of  the 
simulation  data  provides  a  basis  for 
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partitioning  these  data  among  the  system's 
processing  nodes.  The  name  space  by 
which  objects  are  identified  and  accessed 
provides  a  logical  basis  for  communication 
between  objects  that  is  independent  of  the 
objects'  locations  within  the  physical 
system.  This  maintains  the  simple  view  of 
the  simulation  program  as  a  collection  of 
communicating  entities.  It  also  provides  a 
means  for  transparently  relocating  the 
objects  among  the  nodes  (for  example,  to 
improve  the  load  balance)  without 
disrupting  the  programmer's  view. 

The  remainder  of  this  paper  describes  how 
the  air  traffic  simulation  was  modeled  on 
the  Intel  iPSC/2  system  using  object 
oriented  techniques.  The  next  section 
describes  the  object  oriented  simulation 
model.  The  following  section  describes  the 
implementation  of  this  model  on  the 
iPSC/2  system. 

The  Air  Traffic  Simulation  Model 

The  air  traffic  simulation  program  models 
the  motion  of  aircraft  between  source  and 
destination  airports  through  intervening 
sectors  of  controlled  air  space.  The 
simulation  model  uses  three  main  object 
types: 

airplane,  which  models  the  position, 
velocity,  and  other  characteristics  of 
aircraft, 

airport,  which  models  the  location, 
characteristics  (such  as  runway  headings 
and  lengths)  and  access  to  air  traffic 
control  for  takeoff  and  landing,  and 

sector,  which  models  a  portion  of  the 
airspace  and  the  data  required  to  manage 
the  aircraft  flying  therein. 

Instances  of  these  object  types  arc  created 
to  model  their  physical  counterparts.  In 
addition,  three  process  types  are  used  in 
the  air  traffic  model: 

pilot,  which  commands  the  actions  of  a 
single  aircraft  throughout  a  flight, 


air  traffic  controller,  which  controls 
access  to  airport  runways  for  takeoff  and 

landing  and  provides  separation  between 
aircraft  in  the  airspace  sectors,  and 

dispatcher,  which  schedules  new  flights 
for  departure  at  each  airport. 

Following  the  Hoare  process-monitor 

model  [2]  for  communicating  processes, 

instances  of  these  process  types 

communicate  and  synchronize  their 
actions  by  accessing  instances  of  the 
airplane,  airport,  and  sector  object  types. 
The  latter  three  types  serve  as  monitors  in 
this  communication  model.  Figure  1  shows 
the  access  relationships  between  the 
process  and  monitor  objects;  a  directed  arc 
in  the  graph  from  a  process  type  to  a 
monitor  type  indicates  that  an  instance  of 
the  process  type  accesses  an  instance  of 
the  monitor  type. 


Figure  1:  Access  Relationships  Between 
Simulation  Objects 

The  use  of  a  process-monitor  model 
clarifies  the  communication  relationships 
between  the  simulation  objects  and 
naturally  models  the  physical  system.  For 
example,  a  pilot  requests  permission  to 
land  at  an  airport  by  accessing  the  airport 
object,  which  relays  the  request  to  an  air 
traffic  controller  at  the  airport.  This 
approach  models  a  pilot's  targeting 
communication  to  an  airport  instead  of  to 
a  particular  controller  at  the  airport.  It 
also  easily  allows  the  simulation  to 
accommodate  multiple  air  traffic 
controllers  at  the  airport.  Other 
simulation  techniques  (e.g.,  Misra  [3], 
Time  Warp  [4])  model  all  simulated  entities 
as  reactive  objects  which  receive 
incoming  messages  and  invoke  the 
appropriate  procedures,  which  in  turn 
generate  messages  for  other  objects.  The 
communication  between  processes  and 
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monitors  can  be  cast  into  this  form. 
However,  the  use  of  processes  appears  to 
more  clearly  express  the  temporal 
behavior  of  active  simulation  entities 
(e.g.,  pilots)  by  collecting  their  sequence 
of  actions  into  one  thread  of  execution. 

The  initiation  of  a  simulated  flight  is 
accomplished  by  dynamically  creating 
objects  and  invoking  their  interface 
procedures.  After  a  random  interval  and 
following  a  closed  queueing  model,  the 
dispatcher  process  at  an  airport  creates  an 
airpli^e  object  and  a  pilot  process  to 
control  it.  The  dispatcher  passes 
parameters  to  the  pilot  process  identifying 
the  airplane  object  and  the  destination. 
The  pilot  process  computes  the  course  to 
the  destination  and  then  requests  takeoff 
permission  by  invoking  the 
Request_ctlr()  procedure  of  the  local 
airport  object.  The  airport  object  records 
the  request  for  the  controller,  and  the 
pilot  process  blocks  awaiting  a  reply.  The 
airport’s  air  traffic  controller  process 
obtains  a  request  by  invoking  the 
Deq_request()  procedure  of  the  airport 
object.  After  a  simulated  delay,  it  grants 
the  request  by  invoking  the 
Reply_to_plane()  procedure  of  the 
requesting  pilot's  airplane  object.  The 
pilot  process  unblocks  and  directs  the 
airplane  toward  the  destination  airport  by 
invoking  the  Change_velocity() 
procedure  of  its  airplane  object. 

The  pilot  process  simulates  a  flight  by 
sleeping  for  the  simulated  time  required  to 
reach  the  destination  airport.  (Note  that 
in  other  simulation  models,  the  pilot 
process  could  equivalently  send  itself  a 
message  at  the  future  time  of  arrival.) 
This  is  accomplished  by  invoking  the 
Sleep_flight()  procedure  on  the  local 
sector  object,  i.e.,  the  sector  object  in 
whose  area  the  departing  airport  is 
located.  This  procedure  causes  the  pilot 
process  to  sleep  until  it  reaches  the 
destination.  At  this  time,  the  pilot  process 
requests  permission  to  land  at  the 
destination  airport;  the  sequence  of  object 
invocations  follows  the  sequence  used  to 
depart.  Takeoff  and  landing  delays  are 
measured  and  averaged  for  all  airports. 


The  Sleep_flight()  procedure  encapsulates 
the  actions  required  to  move  the  aircraft 
from  sector  to  sector  of  the  airspace  until 
it  reaches  its  destination.  It  returns 

control  to  the  pilot  process  at  an  earlier 

time  only  if  the  aircraft  comes  into 

conflict  with  another  aircraft  and  must 
change  its  course.  This  design  allows  the 
pilot  process  to  simulate  the  actions  of  a 
real  pilot  in  managing  changes  of  course, 
while  making  the  handoffs  between 
airspace  sectors  transparent  for 
simplicity. 

The  sector  objects  manage  the  progress  of 
flights  across  the  airspace.  Each  sector 
object  represents  a  particular  portion  of 
the  air  space  and  has  an  associated 
controller  process.  Pilots  enter  their 
aircraft  into  the  local  sector  by  executing 
the  Sleep_flight()  procedure.  This 
procedure  enqueues  the  flight  for 
examination  by  the  sector  controller  and 
blocks  the  calling  pilot  process.  The 
sector  controller  process  repeatedly 
dequeues  requests  in  its  sector  by 
invoking  the  Gei_next_request() 
procedure  on  its  sector  object.  The 
controller  records  the  flight  in  its  list  of 
aircraft  operating  in  the  local  sector.  It 
also  computes  the  earliest  time  at  which  it 
will  either  depart  the  sector  or  come  into 

conflict  with  another  aircraft  within  the 
sector.  The  controller  then  unblocks  the 
pilot  process,  which  sleeps  until  this  time. 
If  the  process  awakens  due  to  exiting  the 
sector,  it  invokes  sector  procedures 
required  to  remove  it  from  the  current 
sector  and  enqueue  it  into  the  next  sector 
along  its  path.  If  it  awakens  due  to  a 
traffic  conflict,  control  returns  to  the 
pilot  process  to  take  the  necessary  actions. 
(Pilots  do  not  take  evasive  action  in  the 
current  implementation  of  the 
simulation.) 

The  use  of  a  sector  controller  models  the 
actions  of  its  physical  counterpart.  It  also 
avoids  the  potential  deadlock  that  can 
arise  if  multiple  pilot  processes  access 
airplane  objects  in  order  to  detect 
conflicts  and  atomically  update  the 
airplane  objects'  states. 
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Implementation  on  the  Intel  iPSC/2 
System 

The  air  traffic  simulation  was 
implemented  on  the  Intel  iPSC/2  system 
using  the  Interwork  II  Concurrent 
Programming  Toolkit  [5].  This  C  language 
toolkit  provides  a  global  object  name  space 
spanning  the  system's  nodes  in  which 
objects  can  be  dynamically  created, 
destroyed,  and  accessed  [6].  This  object 
name  space  was  used  to  create  instances  of 
the  simulation  object  types  described 
above  (i.e.,  airplane,  airport,  and  sector). 
Interwork  II  provides  a  built  in  object 
type  for  lightweight  processes;  this  was 
used  to  create  the  simulation  processes 
(i.e.,  the  dispatchers,  pilots,  and 
controllers)  as  instances  of  the 
lightweight  process  type. 

The  use  of  a  global  object  name  space 
greatly  simplified  the  implementation  of 
this  simulation  on  a  distributed  memory, 
parallel  system.  The  simulation  objects 
can  be  directly  referenced  to  invoke  their 
interface  procedures  by  using  their  global 
names.  For  example,  a  pilot  process  can 
request  permission  to  land  at  an  airport  by 
knowing  only  the  airport's  name.  Without 
a  global  name  space,  the  pilot  would  need 
to  know  the  processing  node  on  which  the 
airport  object  resides  and  would  have  to 
send  a  message  to  communicate  with  a 
remote  airport.  Interwork  II's  global 
name  space  transparently  locates  objects 
within  the  system  and  generates  messages 
as  necessary  to  invoke  their  procedures 
on  remote  nodes.  In  addition,  the  airplane 
objects  transparently  migrate  from  node 
to  node  as  the  sector  controllers  access 
their  contents.  This  allows  the  controllers 
to  atomically  update  the  states  of  two 
airplanes  (to  re-vector  them  in  case  of  a 
conflict)  without  regard  to  the  nodes  on 
which  the  airplanes  reside. 

The  use  of  lightweight  processes  allows 
thousands  of  processes  to  be  created, 
which  is  required  for  large  simulations 
with  thousands  of  flights  and  hundreds  of 
airports.  The  use  of  the  global  name  space 
to  access  the  lightweight  processes  allows 


them  to  be  transparently  referenced  on 
remote  nodes.  For  example,  a  sector 
controller  can  unblock  a  pilot  process 
without  knowing  on  which  node  the  pilot 
process  is  located. 

The  parallel  system's  distributed  memory 
architecture  requires  that  the  simulation 
objects  be  partitioned  among  the 
memories  of  the  processing  nodes.  The 
use  of  object  oriented  techniques 
simplifies  this  somewhat  by  encapsulating 
logically  related  data  into  separate  object 
instances.  To  maximize  the  load  balance  of 
the  simulation,  the  objects  are  distributed 
across  the  nodes  according  to  their 
position  within  the  airspace.  Thus, 

parallelism  in  the  simulation  is  achieved 
using  a  domain  decomposition,  where  the 
domain  is  the  simulated  airspace. 

In  order  to  avoid  manually  placing  the 
objects  on  the  nodes,  an  Interwork  II 
indexed  object  is  used  to  represent  the 
airspace.  An  indexed  object  is  a  collection 
of  related  objects,  which  are  referenced 
within  the  collection  using  an  n- 

dimensional  index  (7).  Interwork  II 
automatically  distributes  the  component 
objects  across  the  nodes  so  as  to  best 
balance  the  number  on  each  node  and 
minimize  communication  delays  between 
neighboring  objects.  In  this  simulation, 
an  indexed  object  is  used  to  represent  the 
airspace,  and  each  component  object 

represents  a  sector  of  the  airspace.  In  this 
manner,  the  sector  objects  are 
transparently  partitioned  across  the 
processing  nodes.  In  addition,  by 
associating  the  airport  and  dispatcher 
objects  with  their  corresponding  sector 
objects,  these  other  objects  are  also 

transparently  partitioned  across  the 
nodes.  This  method  statically  load 
balances  the  simulation  load  within  the 
parallel  system,  which  works  well  when 
the  airports  and  resulting  air  traffic  are 
evenly  distributed  across  the  air  space. 
The  relationship  between  the  sector 
objects  and  their  associated  airport  objects 
is  depicted  in  figure  2.  Future 
implementations  need  to  dynamically 
remap  sectors  to  nodes  to  provide  dynamic 
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load  balancing  for  irregular  simulation 
loads. 


Figure  2:  Relationship  Between  Simulation 
Objects 


Discrete  event  simulations  run  within  a 
common  time  base,  with  which  they 
synchronize  their  actions.  For  example, 
this  time  base  is  used  to  synchronize  pilot 
processes  executing  the  Sleep_flight() 
procedure.  Since  the  iPSC/2  system  has  no 
global  clock,  a  global  software  clock  is 
synthesized  by  Interwork  II  [8].  The 
method  used  by  Interwork  II  has  the 
advantage  of  relative  simplicity  in 
comparison  to  the  Time  Warp  approach  [4]. 
It  also  has  the  flexibility  necessary  to 
capture  the  simulation  model's  dynamic 
object  creation,  unlike  the  Chandy-Misra- 
Bryant  [3]  approach.  However,  in  order  to 
achieve  substantial  parallel  speedup,  this 
method  requires  the  availability  of  many, 
well  distributed  simulation  events  to 
process  at  each  simulated  time.  The 
current  implementation  uses  a  floating 
point  simulation  clock  whose  fine 
granularity  results  in  relatively  few 
events  to  process  at  each  simulated  time. 
Future  implementations  will  use  integral 
clock  periods  with  sufficient  granularity 
to  capture  the  simulation's  behavior.  This 
approach  should  yield  much  greater 
parallelism  for  exploitation  by  the  parallel 
system. 

Maximal  parallel  speedup  also  depends  on 
the  distribution  of  the  simulation  events 
across  the  processing  nodes  at  each 
simulated  time.  Future  versions  will 
explore  the  use  of  alternative  (e.g.. 


interleaved)  mappings  of  the  sector 
mappings  to  the  nodes.  Beyond  this, 
dynamic  load  balancing  algorithms  may 
be  needed  to  compensate  for  the  dynamic 
redistribution  of  simulation  events  across 
the  nodes  as  the  simulation  progresses. 
The  use  of  a  global  object  name  space 
enables  simulation  objects  to  be 
transparently  moved  between  nodes  to 
load  balance  the  simulation  without 
changing  the  application  code. 

Summary 

This  paper  has  described  a  discrete  event 
simulation  model  of  air  traffic  flow  within 
the  United  States.  This  simulation  model  is 
characterized  by  having  a  very  large 
number  of  simulation  entities  and  by  its 
dynamic  creation  and  destruction  of  these 
entities.  The  simulation  program  was 
written  in  an  object  oriented  manner  for 
the  Intel  iPSC/2  distributed  memory, 
parallel  system  using  the  Interwork  II 
Concurrent  Programming  Toolkit.  The 
iPSC/2  system  has  sufficient  memory  to 
hold  the  very  large  simulation  program, 
and  it  has  the  processing  power  to  exploit 
the  simulation's  parallelism. 

The  use  of  object  oriented  programming 
techniques  led  to  a  very  straightforward 
simulation  model.  Interwork  Il's  global 
object  name  space,  indexed  object 
paradigm,  and  global  synchronization 
mechanism  simplified  in  the  model's 
implementation  on  the  iPSC/2  system.  In 
particular,  it  provided  the  means  to 
automatically  distribute  the  simulation 
objects  among  the  processing  nodes  and  to 
transparently  access  objects  on  remote 
nodes. 

More  work  is  needed  to  measure  and 
improve  the  parallel  speedup.  In 
particular,  the  use  of  an  integral 
simulation  clock  and  dynamic  load 
balancing  techniques  may  prove  useful  in 
extracting  more  parallelism  from  the 
simulation. 
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Abstract 

Techniques  for  implementing  an  aircraft 
simulation  and  control  system  on  a  network  of 
transputers  are  described.  Different  parallelisation 
approaches  are  shown  to  be  appropriate  within  each  of 
the  major  constituent  processes.  The  task  of 
atmospheric  turbulence  simulation  is  used  to  illustrate 
the  procedure. 

Introduction 

The  Inmos  Transputer  is  a  processing  clement 
which  allows  great  flexibility  in  the  design  of  a 
disuibuted  memory  concurrent  computer.  The  basic 
architecture  of  the  T-800  series,  comprising  32-bit 
CPU,  on-chip  FPU,  4K  bytes  of  RAM,  4  bi-directional 
(20  Mbit/sec)  links,  and  a  32-bit  external  memory 
interface,  is  now  widely  familiar,  and  well  documented. 
The  application  of  transputers  to  simulation  in  the 
aerospace  industry,  however,  lags  behind  systems  which 
are  usually  based  either  on  bus-connected  conventional 
microprocessors,  or  dedicated  shared  memory 
minisupercomputers.  Previously  the  necessity  of 
programming  in  Occam,  or  of  writing  a  dedicated 
Occam  harness  to  handle  inter-processor 
communication,  coupled  with  the  lack  of 
cross-development  tools,  proved  a  stumbling  block 
which  undoubtedly  restricted  the  more  widespread  use  of 
transputers.  The  development  of  parallel  operating 
systems,  and  parallel  versions  of  Fortran,  C,  Pascal  and 
recently  Ada  for  the  transputer,  should  greatly  assist  in 
the  translation  of  existing  codes  and  provide  a  more 
open  environment  for  code  development. 

This  work  is  concerned  with  the  conversion  of  a 
sequential  Fortran  code  for  aircraft  simulation  and 
control  to  run  on  a  transputer  network.  The  problem 
area  provides  a  challenge  for  effective  parallelisation, 
since  the  system  functions  as  an  aggregation  of  very 
disparate  algorithms  working  in  loose  synchronisation. 


The  purpose  of  the  paper  is  to  examine  simple 
parallelisation  strategies  for  the  various  computational 
tasks  which  are  performed  by  the  separate  functional 
units.  The  assessment  of  potential  performance  is 
intended  to  guide  future  work,  which  will  concern 
synthesis  of  the  complete  parallel  system.  The  paper  is 
divided  into  four  sections  as  follows. 

The  main  components  of  the  simulator  are  reviewed  in 
the  first  section. 

The  second  section  discusses  the  use  of  information 
flow  routes,  and  the  amount  of  computation  within 
procedures  to  guide  the  division  of  the  system  into  a  set 
of  communicating  tasks.  The  methods  for  software 
implementation  and  effective  mapping  onto  the 
transputer  network  are  outlined. 

In  the  third  section,  the  simulation  of  atmospheric 
turbulence  is  used  as  an  example  to  illustrate  how  the 
parallelisation  strategies  may  be  applied.  The  problems 
in  achieving  effective  parallelisation  for  a  complex  task, 
and  of  load  balancing  the  parallel  turbulence  generator, 
simulator,  and  control  system  is  then  addressed. 

Finally,  in  the  fourth  section,  the  procedures  are 
reviewed,  and  the  conclusions  presented. 

Overview 

The  simulator  used  for  this  work  was  designed  to 
fulfill  two  roles.  Firstly,  it  should  allow  the  handling 
qualities  of  modified  aircraft  configurations  to  be 
studied,  in  both  the  linear  (low  angle  of  attack)  and 
non-linear  (high  angle  of  attack)  regimes.  The  second 
purpose  is  to  assess  the  effectiveness  of  various  control 
techniques  to  alleviate  gust  loads  (due  to  atmospheric 
turbulence),  and  for  the  optimisation  of  advanced 
configurations  in  manoeuvring  flight.  The  principal 
components  of  the  system  are  outlined  in  Fig.l . 

The  simulator  (1)  models  the  response  of  the 
aircraft  to  inputs  u.  from  the  controller,  and  d  from  the 
turbulence  field.  At  each  time  step,  the  simulator 
outputs  the  updated  complete  state  of  the  aircraft  to  a 
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database.  This  also  holds  details  of  the  terrain,  other 
aircraft,  etc.  ,  and  it  may,  in  turn,  be  connected  to  a 
graphics  display  pipeline,  via  a  ring  control  to  which 
other  simulators  can  be  attached.  (The  graphics 
processing,  and  ring  control  are  not  of  primary  concern 
here;  an  effective  parallel  solution  is  documented  in  the 
Inmos  applications  notebook,  [I],  and  their  note  36  (2 
].  The  simulator  also  outputs  the  position  i  of  the 
aircraft  to  a  local  database  attached  to  the  turbulence 
generator. 

The  turbulence  generator  (2)  performs  the  twin 
functions  of  regularly  updating  the  turbulence  field,  and 
of  calculating  the  components  of  the  local  turbulent 
gust  velocity  i ,  when  the  aircraft  is  at  a  point  specified 
by  2t.  (The  variation  in  turbulent  velocity  across  the 
wing  span  introduces  significant  moments.  It  is 
necessary  to  include  this,  either  by  introducing  averaged 
asymmetric  turbulent  velocities,  (by  integrating  the 
turbulent  velocities  across  the  span),  or  for  a  more 
accurate  treatment,  to  utilise  a  three  dimensional 
turbulence  field). 

Finally,  the  controller  (3),  which  also  has  a  local 
database,  accepts  inputs  from  either  the  pilot,  or  a  preset 
trajectory.  It  then  compares  this  with  the  current 
estimate  of  the  aircraft  state  (which  it  computes  from 
the  measured  simulator  output  i),  and  sends  a  control  u 
to  the  simulator. 

The  first,  obvious,  stage  in  parallelism  follows 
from  the  identification  of  the  three  major  components. 
The  use  of  specialised  local  databases,  where  possible, 
clearly  distributes  the  communications  load.  Mapping 
the  major  tasks  onto  a  network  of  distributed 
processors  offers  the  major  challenge,  as  considered 


Parallelisation  Strategies  for  Transputers. 

Most  of  the  literature  on  parallelising  sequential 
code  is  concerned  with  multi-  processors  of  the  shared 
memory  type,  and  concentrates  on  DO  LOOPS  as  the 
source  of  parallelism.  Nearly  70  references  to  work  in 
this  area  are  listed  in  [3],  which  also  reports  some 
preliminary  experiments  on  the  feasibility  of  loop 
partitioning  across  a  transputer  chain. 

At  present,  there  is  no  efficient,  autoparallelising 
conventional  language  compiler  suitable  for  a 
distributed  memory  concurrent  computer.  As  described 
in  [4],  to  achieve  this  requires  some  information  to  be 
specified  on  the  placement  of  data.  The  present  parallel 
compilers  for  the  transputer  essentially  mimic  the 
underlying  Occam,  in  concentrating  on  the  distribution 
of  tasks,  to  which  the  necessary  inter-task  message 
handling  must  be  added.  Although  improvements  in 
compilers  (particularly  C),  and  operating  systems  are 
being  made,  translation  of  sequential  code  is  still  a 
"hands-on"  approach.  Fortunately  though,  it  is  not 
necessary  (particularly  using  the  farm  approach 
discussed  below)  to  explicitly  specify  every  realisation 
of  a  concurrent  task.  A  knowledge  of  the  relative 
performance  of  the  transputer  at  processing, 
communications,  and  the  speed  ratio  for  on-  versus 
off-chip  RAM  is  however,  obviously  required.  The 
transputer  has  the  ability  to  communicate  across  links, 
whilst  concurrently  executing  a  second  process, 
although  each  link  transfer  in  all  cases  occupies  the 
processor  for  a  "setup”  time.  Decisions  regarding  the 
optimal  parallelisation  thus  evolve  by  balancing  the 
length  of  data  packets  sent,  the  setup  times,  as  well  as 
the  connection  topology,  and  the  ability  to  utilise 
{xu'allel  execution  threads,  as  discussed  in  [5]. 

A  PC-hosted  network  of  eight  T-8(X)  transputers 
using  the  3L  parallel  Fortran  compiler  was  used  in  this 
study,  although  future  development  work  and 
implementation  will  use  a  Meiko  Computing  Surface, 
with  an  initial  complement  of  32  T-800  transputers. 

The  techniques  for  parallel  processing  may  be 
grouped  into  three  principal  paradigms:  the  geomeuic 
array,  the  algorithmic  pipe,  and  the  processor  farm.  It  is 
chiefly  the  latter  two  of  these  which  have  been  applied 
to  the  pan  of  the  work  reported  here.  (The  geometric 
array  is  clearly  advantageous  for  the  display  processing, 
and  is  also  suitable  for  a  turbulence  generator  more 
complex  than  that  chosen  as  an  illustration  below.) 

Algorithmic  pipelines 

In  the  algorithmic  pipe,  each  task  is  first 
examined  to  establish  the  routes  along  which  the  data 
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processing  flows,  and  the  connections  between  routes. 
A  single  section  of  data  flow,  between  input  to  the  task, 
and  output  from  the  task  is  then  treated  as  an 
algorithmic  pipeline.  There  are,  however,  two  major 
difficulties  with  the  algorithmic  pipeline  which  have 
been  encountered. 

Firstly,  for  any  given  task,  there  is  a  limit  to 
the  number  of  pipelined  sections  into  which  it  may 
efficiently  be  divided.  As  the  number  of  pipeline 
sections  is  increased,  the  work  done  in  each  section 
decreases,  whereas  the  communication  between  tasks 

tends  to  increase,  since  more  intermediate  results  need  to 
be  transferred.  Consequently  the 
computation/communication  ratio  for  each  section 
diminishes  rapidly.  Although  the  transputer  is  capable 
of  concurrent  computation  and  communication  in 
parallel  threads  once  the  communication  link  engines 
are  started,  a  fixed  time  interval  is  required  to  set  up  the 
link  engines.  The  subdivision  of  each  task  is  relatively 
coarse  grained  as  a  result,  although  since  there  is  a 
large  number  of  tasks,  this  does  not  prove  to  be 
restrictive  in  practice. 

A  second  problem  is  that  one  or  more  sections 
in  the  pipeline  may  require  far  more  computation  time 
than  other  stages,  leading  to  execution  bottlenecks, 
and  consequent  low  parallel  efficiencies.  Sub-division 
into  more  sections  is  often  not  possible,  for  the  reasons 
above.  In  this  case,  possible  solutions  arc: 

(i)  use  a  faster  processor  (  a  valid  option  now  that 
heterogeneous  processor  networks  (for  example  T-8(X) 
and  i860)  are  available.) 

(ii)  Effectively  "widen  the  pipeline"  at  the  bottleneck, 
using  an  array  of  transputers. 

Two  examples  taken  from  the  simulator  illustrate 
the  application  of  algorithmic  pipelines  to  the  problem 
area. 


Fis.Za 


Simple  Algorithmic  Pipeline. 


OUtpJtJ^Pl 


solve  for 
quaternion 
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integrate  solve  tor 
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Fig.  2a  shows  a  simple  process,  that  of 
calculating  a  new  value  of  the  direction  cosine  mauix 
[S]  from  the  rotation  rate  vector.  The  computation 
follows  three  stages,  and  is  thus  divided  into  three 
communicating  processes,  marked  by  boxes.  Since  the 
computational  requirements  of  each  process  is  roughly 
comparable,  each  task  may  be  placed  on  a  separate 
processor.  The  simplest  approach  to  resolving 
computational  imbalance,  which  can  sometimes  be 
applied,  is  to  merge  those  requiring  least,  and 
sub-divide  into  a  cascade  those  requiring  most 
computation  ,  (eg.  for  a  multistage  integration  scheme). 


Solve  tor  each  To'iai  Rot 

element  M,  Moment  Rate 

Fig.  21}. 

Use  of  Embedded  Geometric  Array  in 
Algorithmic  Pipe. 


To  update  the  rotation  rate  vector  is  slightly 
more  complex,  (fig.  2b),  particularly  in  a  more  exact 
simulation  where  the  variation  in  turbulent  gust 
velocity  across  the  wing  span  is  included  in  the  model. 
The  simplest  approach  considered  effectively  divides  the 
wing  span  into  parallel  strips,  and  evaluates  the 
contribution  of  each  strip,  to  the  total,  together  with  an 
overall  correction.  Since  this  computation  takes  by  far 
the  longest  time  in  this  section  of  the  simulator, 
branching  the  pipeline,  or  using  an  array  to  compute  the 
elemental  contributions  may  be  used.  This  provides  an 
effective  solution,  and  is  suitable  for  direct  mapping  as 
a  transputer  array  embedded  in  a  linear  pipe. 

Processor  Farms 

The  processor  farm  uses  a  different  approach,  and 
may  be  of  the  data  farming  or  task  farming  type.  In  data 
farming,  identical  copies  of  a  program  ("workers")  are 
distributed  across  a  set  of  processors,  which  are  under 
the  control  of  a  master  program  ("farmer").  The  farmer 
sends  data  packets  to  the  workers,  and  collects  their 
results.  (Task  farming  requires  the  master  to 
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dynamically  load  tasks  onto  processors  vja  the  operating 
system,  and  is  not  considered  h^rp).  A  "router  task"  is 
placed  on  each  transputer  nodt  of  the  network,  to  pass 
the  work  packets  to  the  available  processors.  It  is  not 
necessary  to  specify  the  number  of  processors  or 
configuration,  as  this  is  determined  at  load  time. 

A  problem  with  current  versions  of  the  transputer 
is  that  messages  routed  through  an  intermediate 
transputer  from  one  node  to  another  results  in  delays. 
Consequently  ,  as  the  routing  of  messages  is  generally 
less  efficient  and  the  communications  overhead  larger 
than  in  the  case  of  a  fixed  geometric  partitioning  (where 
the  latter  is  an  alternative),  one  often  achieves 
disappointingly  poor  parallel  efficiencies.  To 
implement  a  general  farm-based  approach  apparently 
still  necessitates  programming  in  Occam,  however  for 
this  work  the  3L  Fortran  library  of  farm  handling 
procedures  was  used.  The  results  obtained  in  a  number 
of  different  applications  indicated  that,  for  this 
implementation  at  least,  high  levels  of  parallel 
efficiencies  could  only  be  attained  if  the  worker  tasks 
are  computationally  intensive. 


Turbulence  Simulation  and 
Aircraft  Control. 


Random  No.  Filter 
Generator 


Scale 
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-Gaussian  Turbulence 


Non-Gaussian 

Turbulence 


The  generation  of  a  single  component  of  a  one 
dimensional  (ie.  no  variation  in  y  or  z)  turbulent  gust 
velocity  field  is  considered  first .  Let  d(x)  denote  the 
vertical  component  of  the  disturbance  velocity  along  the 
x-axis,  (the  direction  of  flight).  It  is  required  that  both 
the  spatial  frequency  content,  and  the  probability 
distribution  function  of  d(x)  should  closely  match  that 
measured  in  real  turbulence.  The  method  used  (fig. 
3)  requires  three  independent  uniform  random  number 
sequences.  (The  procedure  used  for  the  random 
sequence  generation  is  the  standard  linear  congruential 
method,  with  an  additional  randomising  shuffle  table, 

[6] ).Firstly,  the  random  numbers  arc  processed  to 
obtain  a  Gaussian  distribution.  An  approximately 
Gaussian  distributed  random  sequence  is  most  simply 
obtained  by  forming  the  subsequence  consisting  of  the 
mean  of  n  terms  of  the  original  uniformly  distributed 
sequence,  where  n  is  at  least  12. 

The  processed  sequences  serve  as  independent 
white  noise  sources  n|,n2,  and  n3,  and  each  source 
then  passes  through  a  linear  filter.  The  output  from  any 
of  these  filters  has  the  correct  spectral  distribution  of 
frequencies,  but  the  spatial  distribution  is  too 
homogeneous,  as  shown.  To  correctly  represent  the 
non-Gaussian  "patchy"  nature  of  real  turbulence,  the 
outputs  of  the  linear  filters  are  combined  non-lincarly, 

[7] . 


Generation  of  Simulated  Turbulence 


The  longitudinal  and  lateral  turbulent  velocity 
components  are  generated  in  a  similar  fashion,  with 
slightly  different  linear  filters.  To  account  for  the  effect 
of  lateral  variations  in  the  turbulent  velocity  field  ,  the 
simplest  approach  is  to  generate  an  additional  two 
velocity  components.  The  procedure  is  exactly  as 
before,  (though  with  appropriate  filter  characteristics). 
These  velocity  components  account  for  the  integrated, 
moment  inducing  effect,  of  instantaneous  spanwise 
asymmetry  in  the  longitudinal  and  vertical  turbulent 
gust  velocities,  (8).  Thus  fifteen  random  numbers 
generators  may  be  required  to  provide  the  five  turbulent 
velocity  components  required  for  a  two  dimensional 
(strictly  quasi-two  dimensional)  turbulence  field 
simulation. 

The  obvious  technique  which  may  be  used  to 
apply  the  linear  filters  is  the  FFT.  The  spectral 

distribution  of  real  turbulence  however  may  be 
approximated  sufficiently  accurately  for  simulation  by 
a  simple  rational  filter,  such  as: 

H(w)  =  Al  (\  +  B.jw) 
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with  A,B  constants,  and  w  the  frequency  variable. 
The  Z-transform  may  then  be  applied  to  derive  a 
recursive  filter  ,  such  that  the  i-th  turbulence  velocity 
dj  is  given  in  terms  of  previously  filtered  values 
di- ]  ,dj.2,  and  previous  unfiltered  inputs  Wj.  j ,  wj.2  by 
a  relation  such  as  : 

dj  =  kl.dj.j  +  k2  .dj.2  + 

k3.  Wj  +  k4.  Wj.j  +  k5.wj.2 

where  kl,  k2  etc.  are  constants 

Clearly  this  provides  a  considerable 
computational  saving.  Furthermore,  by  relating  the 
position  X,  to  the  time  t,  the  turbulent  velocity 
components  may  be  generated  as  a  time  series,  (ie.  at 
each  time  step  in  the  solution  of  the  equations).  This  is 
the  approach  normally  used  for  the  implementation  of 
one  dimensional  and  quasi  two  dimensional  turbulence 
generators.  For  an  efficient  transputer  implementation 
it  is  not  appropriate  to  attempt  to  generate  each  value 
just  when  it  is  required  for  the  reasons  considered  next. 

The  generation  of  random  numbers  is  well  suited 
to  the  data  farm  aproach,  as  it  is  only  necessary  to  start 
each  generator  with  different  parameters  and  initial 
seeds.  The  generator  is  formulated  as  a  worker  task, 
under  the  supervision  of  a  master  task,  and  a  copy  of 
the  worker  resides  on  each  transputer  including  the  root 
node.  The  master  task,  which  resides  on  the  root 
transputer,  distributes  the  initial  data  to  the  generators, 
and  collects  the  results.  The  sequence  of  random  values 
produced  by  the  generators  may  be  routed  back  to  the 
master  task  in  packets  comprising  either  a  single  value, 
or  up  to  256  values,  (in  the  current  3L  release). 

It  is  not  efficient,  however,  for  each  worker  to 
send  back  one  random  value  at  a  time,  since  the  wasted 
processor  time  required  to  initiate  a  communication 
across  a  hardware  link  then  becomes  very  significant  in 
relation  to  the  time  spent  in  computation.  This  is 
illustrated  in  fig.4  in  which  the  performance  of  a  farm 
based  Gaussian  random  number  generator  is  compared 
for  varying  result  packet  lengths.  Consequently  each 
worker  generates  a  long  array  of  values  which  are 
transmitted  back  to  the  master  as  a  single  packet. 

The  workers  continue  asynchronously  generating  and 
transmitting  packets  without  further  intervention. 
Simple  double  buffering  may  be  used  in  the  output 
from  the  master  task,  to  emulate  continuous  generation, 
if  values  are  required  on  a  pointwise  basis. 

In  the  implementation  of  the  turbulence 
generator,  it  is  possible  to  lump  the  filtering  together 
with  the  random  number  generation  on  the  worker 
tasks.  This  has  the  potential  of  ensuring  that  the 
worker  tasks  are  highly  computationally  intensive. 
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Since  the  filters  differ  for  each  velocity  component 
however,  a  farm-based  implementation  of  this  form 
would  require  non-identical  workers.  Although  this 
could  be  accomplished,  it  adds  complexity,  and  is  not 
easily  realisable  for  a  general  nansputcr  network.  In  any 
event,  the  computational  overhead  required  for  a  simole 
recursive  filter  is  very  low,  and  may  be  placed  entirely 
on  the  master  task,  without  noticeably  reducing  the 
high  efficiencies  achieved. 

To  implement  a  full  three-dimensional  turbulence 
field  generator,  requires  the  use  of  the  FFT  for  the 
spatial  filtering.  Although  it  would  be  very  useful  to 
combine  a  processor  farm  for  the  random  noise 
generation,  with  a  geomeuic  array  to  accomplish  the 
spatial  filtering  in  a  hybrid  parallel  system,  this  option 
is  not  effectively  supported  for  PC-hosted  systems  in 
the  current  3L  release.  For  a  complete  simulation 
therefore,  it  is  necessary  to  specify  the  entire  network, 
effectively  as  a  geometric  configuration.  Load  balancing 
between  the  resources  required  by  the  random 
generators,  and  the  filters  is  thus  fixed,  as  a  result  of  a 
process  of  trial  and  error.  It  is  difficult  to  gain  a 
satisfactory  (or  even  sometimes  any)  gain  in 
performance  each  time  a  single  transputer  is  added  to  a 
completely  geometrically  partitioned  network.  The 
simplest  procedure  is  clearly  to  add  the  Uansputers  to 
increase  the  performance  of  one  part  of  the  network  (eg. 
the  filter  stage)  at  a  time.  Future  work  using  the  Meiko 
should  allow  the  implementation  of  hybrid  strategies. 

The  results  show,  however,  that  it  is  possible  to 
obtain  high  parallel  efficiencies  in  the  generation  of  one 
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Discussion  and  Conclusions 


and  quasi-lwo  dimensional  turbulence. 

For  aircraft  control,  simple  PID  controllers, 
predictive,  and  linear  optimal  controllers  are  being 
considered.  Fig.  5  shows  the  application  of  an  optimal 
controlled  (applied  to  stabilise  the  longitudinal 
dynamics  only,  ie.  velocity  components  u,w,  pitch 
angle  q,  and  rate  q,  and  augmented  by  the  integral  of 
pitch).  For  the  linearised  state  variable  equations, 

2L  =  IA|  2L  +  (Bui  U-  ''‘dl  i 

The  control  vector  u  is  obtained  from 
iL  =  •  lR|-‘  (Bui"*'  (PI  s. 

where  P  is  the  solution  of  the  steady-state  Ricatti 
equation, 

(P|(A1  -r  [A|  T(P) 

-  IP)  IBul  |RJ-‘  [BuJ  T  |P|  +  [Qi  =0 

and  [Q]  and  (R)  arc  the  weighting  matrices  associated 
with  the  state-space  vector  and  control  input 
rc.spcctivciy. 

The  solution  of  the  above  matrix  equation  is 
accomplished  by  converting  to  an  eigenvalue  problem, 
forming  a  matrix  of  eigenvectors,  and  performing  an 
inversion. 

The  parallelisation  procedures  which  have  been 
considered  to  date  to  obtain  the  control  vector  u 
comprise  a  coarse  algorithmic  parallelisation  of  the 
major  stages  of  the  solution  algorithm.  As  the 
techniques  are  as  described  above  this  will  not  be 
discussed  further.  The  potential  efficiencies  attainable 
arc  high,  for  a  coarse  discretisation  is.  Currently 
approaches  for  a  fine  discrcti.sation  arc  being  considered. 


Both  algorithmic  pipeline  and  processor  farm 
approaches  have  been  applied  to  the  parallel 
decomposition  of  an  aircaft  flight  simulation  and 
conuol  system.  Currently  the  techniques  employed  for 
load  balancing  arc  very  much  of  a  "hands-on"  nature,  for 
the  geometric  and  algorithmic  parallelisation  strategies. 
To  allow  more  efficient  general  parallelisation,  the 
problem  o.''  through  communication  delays  needs  to  be 
resolved  on  'lie  hardware  side,  whilst  the  ability  to  mix 
paradigms  would  be  very  useful.  It  is  anticipated  that 
the  next  generation  of  transputers  will  solve  the  routing 
problem,  and  also  facilitate  effective  general 
parallelisation,  rather  than  adherence  to  a  rigid  paradigm. 
Hopefully  this  should  allow  the  obvious  benefits  of  the 
transputer  to  be  more  generally  employed  for  this 
application  area. 
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Abstract 

Combining  techniques  in  an  urban  mobile  com¬ 
munications  channel  are  simulated  using  a 
modified  Hashemi  model.  The  parallel  imple¬ 
mentation  of  this  simulation  on  a  distributed 
memory,  MIMD  parallel  architecture  is  dis¬ 
cussed.  Combining  techniques  suitable  for  the 
first  generation  digital  cellular  communications 
system  in  North  America  are  analyzed.  Scala¬ 
bility  and  overall  performance  results  on  the 
.Myrias  SPS-S  are  presented. 

Introduction 

In  mobile  radio,  energy  can  travel  from  the 
transmitter  to  the  receiver  via  more  than  one 
path.  This  “multipath”  situation  arises 
because  of  reflection  and  scattering  from 
buildings,  trees  and  other  obstacles  along  the 
path.  At  the  receiver,  the  radio  waves  com¬ 
bine  vectorially  to  give  a  resultant  signal 
which  can  be  small.  When  this  occurs,  the 
signal  is  said  to  be  subject  to  fading.  More¬ 
over,  whenever  relative  motion  exists,  there  is 
a  Doppler  shift  in  the  received  signal.  The 
combined  effect  of  fading  and  Doppler  pro¬ 
duces  a  received  signal  with  an  amplitude 
and  phase  that  changes  quite  substantially 
with  time. 

This  poster  sessioi.  describes  the  simulation  of 
the  first  generation  digital  cellular  communi¬ 
cations  system  in  North  America  [l],  together 
with  a  suitable  technique  for  carrying  out  the 
simulation  on  the  Myrias  SPS-2.  a  parallel 
processing  architecture.  Bit  error  rate  curves 
resulting  from  the  simulation  are  presented. 
Scalability  results  and  overall  performance  of 
the  simulation  on  the  SPS-2  are  presented. 
Conclusions  derived  from  results  obtained 
using  this  simulation,  as  well  as  the  suitabil¬ 
ity  of  running  this  simulation  on  the  SPS-2 
are  discussed. 


The  Digital  Radio  Simulator 

The  combined  effect  of  Doppler  and  fading  is 
simulated  using  a  modified  Hashemi  model  12], 
[3j.  A  block  diagram  of  the  channel  simulator 
is  given  in  Figure  1.  The  transmitter  consists 
of  a  precoder  and  a  transmitting  filter 
corresponding  to  the  first  generation  digital 
cellular  communications  system  in  North 
America  |l].  The  precoder  maps  the  informa¬ 
tion  sequence  {a}  into  ;r/4-shifted  Quadra¬ 
ture  Phase  Shift  Keying  (QPSK)  and  the 
transmitting  filter  corresponds  to  a  square 
root  spectral  raised  cosine  filter  with  a  roll-off 
factor  of  0.25. 

The  channel  in  Figure  1  corresponds  to  one 
that  would  be  used  in  a  sparse  high  rise 
urban  environment  at  800  A/Hz  carrier  fre¬ 
quency,  with  the  vehicle  traveling  at 
50  km /hr.  The  delay  spread  is  limited  to 
7  /isec.  which  corresponds  to  intersymbol 
interference  over  only  one  adjacent  symbol. 
This  yields  a  symbol  rale  of  24  ksymbols / sec . 

The  receiver  can  accommodate  a  receiving 
filter  identical  to  the  transmitting  filter,  an 
adaptive  equalizer,  and/or  a  V'iterbi  algo¬ 
rithm  (VA).  The  equalizer  can  be  either  a 
decision  feedback  equalization  (DFE)  or  a 
simple  linear  equalization  (LE).  The  adapta¬ 
tion  algorithm  associated  with  it  can  be 
either  a  least  mean  square  (LMS)  or  a  recur¬ 
sive  least  square  (RLS)  algorithm. 

Parallel  Implementation 

Digital  radio  channel  simulations  are  done  on 
the  Myrias  SPS-2.  The  SPS-2  utilizes  a  dis¬ 
tributed  memory,  MIMD  parallel  processing 
architecture.  A  parallel  implementation  of 
the  simulation  is  done  in  the  following 
manner. 
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The  first  level  of  parallelism  implemented  is 
to  analyze  different  system  parameters/ 
operating  conditions  in  parallel.  For  each  set 
of  these  operating  conditions,  the  simulation 
is  done  over  several  city  blocks.  Calculations 
within  a  given  city  block  are  independent  of 
all  other  locations;  hence  are  done  in  parallel. 
Several  signal  to  noise  ratios  (SNR)  are  simu¬ 
lated  within  the  channel  in  each  city  block. 
Calculations  for  each  of  the  SNR  are  also 
done  in  parallel.  Thus.  3  levels  of  parallelism 
are  identified,  and  implemented. 

Bit  error  rates  predicted  using  the  simulation 
are  discussed  in  the  next  section. 

Simulation  Results 

Using  the  operating  conditions  associated 
with  Figure  1,  simulation  results  were 
obtained.  Figure  2  displays  5  bit  error  rate 
curves  versus  Fj/iVg,  where  is  the  average 
bit  energy  and  Nq/2  is  the  magnitude  of  the 
two-sided  power  spectral  density  of  the  addi¬ 
tive  white  Gaussian  noise.  Curves  .4  and  R 
correspond  to  a  receiver  with  DFE,  curves  D 
and  £  to  a  receiver  with  LE  and  curve  C 
with  no  equalization.  Curves  .4  and  D 
correspond  to  a  receiver  with  no  receiving 
filter,  and  curves  B,  C,  and  £  to  a  receiver 
\vith  a  receiving  filter.  Run  times  for  the 
simulation  and  scalability  results,  using  the 
.Vlyrias  SPS-2  are  presented  in  the  next  sec¬ 
tion. 


Figure  1  -  Mobile  radio  simulation 


Performance  Results 

The  channel  is  simulated  using  9  to  378  pro¬ 
cessors  on  the  SPS-2.  .\bsolute  performance 
results  are  measured  using  9  to  108  proces¬ 
sors.  Scalability  results  are  measured  using 
54  to  378  processors.  It  should  be  noted  that 
no  changes  to  the  code  or  the  executable  are 
required  to  run  the  program  using  different 
numbers  of  processors. 

Figure  3  shows  absolute  performance  results 
for  two  sets  of  operating  conditions.  The  first 
set  of  operating  conditions  includes  a  DFE 
with  no  receiving  filter.  The  second  set  con¬ 
tains  both  a  DFE  and  a  receiving  filter.  Each 
channel  is  simulated  over  6  city  blocks.  9 
SNRs  are  evaluated.  Using  9  processors,  a 
run  time  of  13  hours  and  27  minutes  was 
measured.  The  same  simulations  were  done 
using  18.  54.  and  108  processors. 

The  run  lime  decreases  by  a  factor  of  11.5 
when  the  number  of  processors  is  increased  by 
a  factor  of  12.  from  9  to  108. 

Scalability  results  are  presented  in  Figure  4, 
using  54  to  378  processors  on  the  SPS-2.  Sca¬ 
lability  is  defined  here  to  be  constant  work 
(computation)  per  processor.  One  set  of 
operating  conditions  is  simulated  using  54 
processors,  and  2  sets  on  108  processor'. 


Results  are  also  presented  for  simulations  per¬ 
formed  using  216,  324,  and  378  processors. 
Run  times  (elapsed  time),  non-dimensionalized 
using  a  run  time  of  1  hour  and  9  minutes  on 
54  processors,  vary  from  1.01  on  108  proces¬ 
sors  to  1.10  on  378  processors.  Using  54  pro¬ 
cessors,  96%  of  the  total  run  time  is  actual 
user  (CPU)  time.  1%  is  used  by  the  operat¬ 
ing  system.  Idle  processors  account  for  the 
remaining  3%  of  the  elapsed  time. 


Some  processors  are  idle  during  the  following 
parts  of  the  simulation.  Initially,  only  one 
processor  is  used  when  the  system  parameters 
for  each  datafile  are  being  set  up.  Processors 
are  also  idle  while  the  simulation  over  each 
city  block  is  being  established. 

Using  378  processors,  idle  time  accounts  for 
12%  of  the  run  (elapsed)  time.  User  time 
(CPU)  is  87%.  The  remaining  1%  is  used  by 
the  operating  system. 


A  -  OFE,  no  receiving  filter 
B  -  DFE,  with  receiving  filter 
C  -  no  equalization,  with  receiving  filter 
D  -  LE,  no  receiving  filter 
E  -  LE.  with  receiving  filter 


E  /N 
b  0 

Figure  2.  The  bit  error  rate  for  urban  land 


Number  of  P 

Figure  3.  Scalability  -  fixed  | 
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Conclusions 

From  Figure  2.  it  is  seen  that  DFE  offers  a 
poorer  performance  than  LE,  and  should 
therefore  be  avoided.  It  is  also  seen  that  LE 
offers  a  slight  improvement  in  performance 
over  having  only  a  receiving  filter.  However, 
the  improvement  is  very  small  at  bit  error 
rates  of  10^  For  that  reason,  it  is  believed 
that  equalization  is  not  needed. 

The  VA  is  not  displayed  in  Figure  1,  since  it 
was  found  that  with  a  delay  spread  limited  to 
7  psec,  the  channel  is  essentially  flat.  In 
other  words,  there  is  no  need  for  a  VA  for 
such  a  channel.  Moreover,  it  is  found  that 
with  a  vehicle  travelling  at  bOkrn/hr.  the 
channel  is  fast  fading,  and  the  LMS  algo¬ 
rithm  fails  to  track  the  rapid  amplitude  and 
phase  variations  of  the  channel. 

The  Myrias  SPS-2  is  a  very  suitable  computer 
platform  for  doing  this  type  of  simulation. 
.\n  existing  code,  written  in  Fortran  77,  was 


ported  to  the  SPS-2  with  very  little  effort. 

Testing  and  evaluation  of  various  techniques 
used  in  the  channel  simulation  program  are 
done  rapidly  on  the  SPS-2.  Use  of  the  simu¬ 
lation  on  the  SPS-2  enables  rapid  evaluation 
of  many  configurations  before  the  final  re-./ 
world  experiments  are  conducted. 
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Abstract 

A  multi-player  3D  Asteroids  video  game  de¬ 
signed  to  be  used  as  a  testbed  for  evaluating  con¬ 
troller  algorithms  was  described  in  [1.]  The  origi¬ 
nal  version  of  the  game  and  a  separate  interactive 
3D  graphics  interface  for  a  humein  player  were  im¬ 
plemented,  based  on  CrOS  III  and  VERTEX,  on 
an  NCUBE-1  hypercube  equipped  with  a  paral¬ 
lel  Real-Time  Graphics  board.  The  Asteroids  and 
interactive  graphics  interface  programs  are  exam¬ 
ples  of  parallel  programs  which  communicate  with 
each  other  in  a  space-shared  multi-processor  envi¬ 
ronment. 

We  have  successfully  ported  the  Asteroids  and 
the  interactive  graphics  interface  to  run  on  NCUBE 
using  ParaSoft  EXPRESS.  The  new  version  of  these 
programs  were  further  ported  to  run  on  a  SUN 
386i  with  an  add-on  Transputer  board.  We  present 
general  design  considerations  that  enable  easy  mi¬ 
gration  of  communicating  parallel  programs  to  any 
other  hardware  platform  that  runs  EXPRESS.  We 
also  report  specific  experience  of  porting  Asteroids 
and  an  associated  interactive  player  interface  pro¬ 
gram  on  an  NCUBE  hypercube  to  a  SUN  386i 
Transputer-based  system,  with  no  modification  of 
codes. 

Introduction 

Code  portability  is  a  major  concern  for  peo¬ 
ple  who  writes  programs,  and  especially  so  for  those 
who  implement  computation  intensive  algorithms. 
Scientists  would  like  to  run  their  specialized  codes, 
without  modification,  on  faster  computers  when¬ 
ever  they  are  available.  Ample  examples  can  be 
found  in  the  fields  of  computational  fluid  dynam¬ 
ics,  chemical  dynamics,  and  in  quantum  chromody¬ 
namics,  just  to  name  a  few. 


There  is  another  class  of  computation  inten¬ 
sive  programs  which  compete  or  cooperate  with  one 
another  within  a  simulated  organizational  struc¬ 
ture.  Usually,  these  are  programs  which  implement 
artificial  intelligence,  decision-making  algorithms. 
An  example  of  a  simulated  organizational  frame¬ 
work  is  a  game  environment  with  a  geune  mauiager 
program  which  coordinates  the  actions  and  com¬ 
petitions  of  multiple  player  programs  via  message¬ 
passing.  The  “players”  and  the  gaune  manager  can 
benefit  from  parallelization.  However,  it  is  difficult 
to  develop  portable  codes  for  these  communicating 
parallel  programs  which  does  not  only  require  inter¬ 
processor  communication  within  each  program  but 
also  communication  among  different  programs. 

There  is  a  proliferation  of  small  parallel  com¬ 
puter  systems  for  tutorial  and  experimental  pur¬ 
poses.  Among  these,  the  Transputer-based  system 
is  a  popular  one.  We  present  general  system  design 
guidelines  which  enable  easy  porting  of  the  game 
environment  to  other  hardware  platforms  that  are 
supported  by  EXPRESS,  eind  discuss  specific  expe¬ 
rience  in  the  porting  of  the  NCUBE  Asteroids  to 
SUN  386i  Transputer-based  system. 

Why  A  Game? 

There  have  been  mammoth  interest  in  re¬ 
search  on  intelligent  controller  algorithms  which 
cein  perform  tasks  that  normally  require  human  su¬ 
pervision  for  decision  making.  Some  examples  of 
such  tasks  are  navigation  control  and  multi-target 
tracking  [2,  3.]  Intelligent  algorithms  are  in  general 
very  computation  intensive.  Moreover,  there  are 
no  effective  ways  to  evaluate  or  compare  the  per¬ 
formance  of  these  algorithms,  either  running  alone 
or  simultaneously. 

A  dynamic  game  which  contains  the  features 
of  randomness,  secrecy,  incomplete  and  noisy  infor- 
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mation,  as  well  as  limited  resources  of  the  players 
would  provide  a  natural  arena  for  these  algorithms. 
Such  a  g2ime  generates  a  consistent,  dynamically 
evolving  environment  for  the  participating  player 
programs  which  are  implementations  of  various  al¬ 
gorithms  for  some  simple,  well-defined  objectives. 
It  is  also  essentia]  that  such  a  game  be  implemented 
on  a  powerful  computing  environment  so  that  com¬ 
putation  intensive  algorithms  can  compete  fairly  in 
real-time. 

Asteroids 

The  Asteroids  arcade  game  is  a  single  player 
game  which  features  a  spacecraft  traversing  a  2D 
toroidal  space  with  inert,  moving  celestial  bodies  of 
various  sizes.  Given  an  interactive  graphics  display 
and  button-controlled  interface,  a  human  player  can 
maneuver  a  spacecraft  to  turn,  thrust,  yank,  or  to 
fire  missiles.  The  objective  of  the  game  is  very  sim¬ 
ple.  It  is  to  destroy  as  many  asteroids  as  possible 
without  being  hit  by  them.  Large  asteroids  split 
into  multiple  smaller  ones  when  hit  by  other  as¬ 
teroids,  missiles  or  spacecraft.  A  spacecraft  is  de¬ 
stroyed  when  hit  by  any  objects.  Since  the  Aster¬ 
oids  game  is  conceptually  simple,  we  have  chosen  to 
implement  it,  with  some  enhancement,  as  a  testbed 
for  the  evaluation  of  intelligent  algorithms  which 
are  developed  specifically  to  achieve  the  game  ob¬ 
jectives. 

We  have  implemented  a  3D  Asteroids  game 
environment  on  an  NCUBE  hypercube  which  was 
equipped  with  a  parallel  graphics  board  [1.]  The 
software  system  was  based  on  CrOS  III  and  VER¬ 
TEX.  The  implementation  of  Asteroids  on  a  space- 
shared  concurrent  processor  makes  it  easy  to  com¬ 
pare  performance  of  different  algorithms  that  are 
assigned  to  a  common  task  at  the  same  time. 
Preferably,  an  intelligent  algorithm  in  use  is  paral¬ 
lelized  to  take  advantage  of  the  multi-processor  ar¬ 
chitecture  for  efficiency.  Otherwise,  it  will  be  trivial 
to  modify  a  sequential  program  so  that  it  will  run 
on  one  node  of  a  concurrent  processor,  and  still  be 
able  to  take  part  in  the  game. 

The  enhanced  Asteroids  game  m.odels  space- 
crzffts  and  asteroids,  governed  by  physics]  laws, 
traversing  a  3D  toroidal  space.  Unlike  the  arcade 
game,  spacecrafts  are  not  destroyed  immediately 
when  collide  with  other  flying  objects.  They  only 
lose  ‘energy’  which  is  used  as  an  index  of  cost.  If 
a  player’s  spacecraft  is  out  of  energy,  that  player  is 
out  of  the  game. 


3D  Asteroids  is  designed  to  accommodate 
multiple  ‘players’.  All  players  do  not  have  to  join 
the  game  at  the  same  time.  At  any  time  when  the 
game  is  running,  the  game  program  is  capable  of 
adding  new  or  removing  existing  ‘players’.  This 
arrangement  allows  a  real-time  competition  eimong 
the  different  ‘players’  who  are  subjected  to  the  same 
global  conditions  and  games  rules,  but  are  occupy¬ 
ing  different  locations  in  the  3D  toroidal  space. 

Overall  Design  of  Asteroids 

One  way  to  look  at  the  Asteroids  game  sys¬ 
tem  is  to  treat  the  game  objectives  as  the  objec¬ 
tive  functions  of  an  optimization  problem  which 
is  constrained  by  the  imposed  game  rules.  The 
user-supplied  algorithms,  including  the  interactive 
player’s  intelligence,  implement  different  approaches 
to  solve  the  posed  problem.  Therefore,  it  is  essen¬ 
tial  that  the  game  can  support  multiple  players  for 
the  purpose  of  direct  comparison  of  several  algo¬ 
rithms.  This  also  makes  the  game  more  realistic 
and  exciting. 

All  programs  that  are  involved  in  Asteroids 
do  not  make  any  assumptions  about  the  underlying 
hardware  environment,  and  are  classified  by  func¬ 
tionality  into  three  categories.  They  are  the  ‘game 
driver’,  ‘player’,  and  ‘graphics  driver’  programs. 

The  ‘game  driver’  is  the  core  of  the  game. 
Only  one  copy  of  the  game  driver  is  needed  at  all 
time.  The  primary  entities  in  the  game  driver  are 
objects  like  spacecraft,  missiles,  and  asteroids.  It 
implements  rules  of  the  game,  processes  player  re¬ 
quests,  and  evolves  game  objects  in  time. 

There  are  two  types  of  ‘player’  programs.  An 
interactive  player  program  implements  a  3D  graph¬ 
ics  interface  for  a  hum^tn  player  to  control  a  space¬ 
craft,  while  a  batch  player  program  implements  an 
intelligent  algorithm  to  take  over  the  responsibili¬ 
ties  of  what  a  human  player  is  supposed  to  do  dur¬ 
ing  the  game.  A  player  program  is  isolated  from 
the  rest  of  the  game  so  that  any  modifications  of  it 
will  affect  the  performance  of  an  individual  player 
only,  and  has  no  effect  on  the  operation  of  the  game 
itself. 

A  ‘graphics  driver’  is  an  interface  between 
player  programs  and  the  graphics  hardware.  It  pro¬ 
vides  the  low-level  graphics  operations  for  player 
programs  and  isolates  them  from  the  ever-changing 
graphics  hardware.  An  interactive  player  program 
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certainly  needs  graphics  support  because  a  human 
player  relies  on  the  visual-oriented  display  to  make 
decisions,  A  batch  player  program  has  the  option 
of  using  graphics  display  for  the  convenience  of  the 
observers  of  the  game. 

Hardware  Considerations 

The  first  version  of  Asteroids  was  developed 
for  an  NCUBE  hypercube  with  a  Real-Time  Par¬ 
allel  Graphics  Board  which  has  16  NCUBE  pro¬ 
cessors,  and  uses  Hitachi  HD63484  Advanced  CHT 
Controllers  (ACRTC.)  The  processors  on  the  graph¬ 
ics  board  will  be  called  graphics  nodes,  and  those 
on  an  NCUBE  hypercube  will  be  called  array  nodes 
for  nomenclature  convenience. 

Figure  1  is  a  block  diagram  of  an  NCUBE 
with  a  parallel  graphics  system.  The  control  pro¬ 
cessor  of  the  entire  system  is  an  Intel  80286.  Two 
distinct  features  of  the  graphics  system  are  that  the 
16  graphics  nodes  are  capable  of  communicating 
with  each  other,  or  with  the  array  nodes  using  high¬ 
speed  I/O  channels;  also,  signals  from  the  graphics 
tablet  can  by-pass  the  control  processor  and  reach 
the  graphics  board  directly  via  a  RS-232  port  at 
19200  baud. 


Figure  1;  A  Block  Diagram  for  an 
NCUBE-1  with  Real-Time  Graphics 

A  graphics  node  can  issue  a  graphics  com¬ 
mand,  by  sending  a  message  to  the  80186  on  the 
graphics  board,  to  initiate  a  DMA  transfer  of  pixel 
data  in  the  local  memory  of  the  graphics  node  to 
the  frame  buffer  of  the  display  monitor.  Local  mem¬ 
ory  of  the  graphics  nodes  are  mapped  to  the  frame 
buffer  in  alternating  2-pixel  wide  strips.  A  DMA 


transfer  takes  1/30  second.  However,  altering  the 
data  in  the  frame  buffer  while  a  DMA  is  in  progress 
usually  produces  an  unpleasant  among  of  flicker. 


Figure  2:  A  Block  Diagram  for  a  SUN 

380i  with  an  add-on  Transputer  Board 

Of  the  128K  local  memory  available  in  the 
graphics  nodes,  about  20K  is  used  by  GRAPHOS 
(a  nucleus  similar  to  VERTEX.)  A  single  buffer  for 
each  graphics  node  is  48K.  If  2  consecutive  1/16 
frames  of  display  are  to  be  computed  by  each  of 
the  16  graphics  nodes  before  calling  a  DMA  trans¬ 
fer,  the  executable  graphics  processing  program  on 
the  graphics  nodes  has  to  be  smaller  than  about  8K. 
It  is  unreasonable  to  expect  any  realistic  graphics 
programs  to  occupy  only  8K  memory  space.  There¬ 
fore,  it  is  very  difficult  to  make  use  of  the  2-Mbyte 
frame  buffer  for  real-time  double  buffering. 

A  block  diagram  for  a  SUN  386i  is  shown  in 
Fig.  2.  It  is  a  much  simpler  configuration  because 
it  does  not  have  a  parallel  graphics  system.  The 
SUN  386i  acts  as  the  control  processor  for  an  add¬ 
on  multi-processor  Transputer  board.  All  input  de¬ 
vices  are  connected  to  the  SUN  386i.  There  is  no 
direct  I/O  channel  from  the  Transputer  board  to 
the  frame  buffer  which  talks  to  the  386i  only.  This 
hardware  configuration  will  not  support  real-time 
animation.  However,  the  speed  of  graphics  dis¬ 
play  can  edways  be  improved  by  adding  specialized 
graphics  hardware  later. 

System  Software  Considerations 

There  is  no  established  standard  for  the  new 
parallel  computer  languages,  programming  method¬ 
ologies  and  operating  systems.  We  have  chosen 
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to  implement  the  new  version  of  Asteroids  on  an 
NCUBE  and  a  SUN  386i  Transputer  system  using 
ParaSoft  EXPRESS.  The  few  reasons  behind  this 
choice  are  that  EXPRESS  is  portable,  simple,  effi¬ 
cient,  and  CrOS  compatible.  Any  carefully  written 
EXPRESS  appUcations  can  be  migrated  separately 
from  one  hardware  platform  to  another  relatively 
easily,  as  long  as  the  computer  system  runs  EX¬ 
PRESS. 

Implementation  Guidelines 

Portable  and  efficient  intra-program  commu¬ 
nication  is  easy  because  they  are  provided  by  EX¬ 
PRESS  functions  which  are  already  available  for  a 
wide  range  of  parallel  computer.  However,  port¬ 
ing  a  set  of  parallel  programs  which  space-shared 
a  concurrent  processor  and  communicate  with  one 
another  is  not  as  straight-forward.  Some  operating 
systems,  like  VERTEX  on  NCUBE,  do  not  allow  a 
parallel  program  to  send  messages  outside  its  own 
allocated  sub-cube  to  another  sub-cube  within  the 
main  array.  Also,  there  are  hardware  dependent 
codes  such  as  those  for  graphics  display.  Efficient 
graphics  are  hardly  portable  because  it  involves  too 
much  hardware  specific  programming. 

In  order  to  make  Asteroids  portable,  i.e.,  to 
run  the  same  game  driver  and  its  associated  player 
programs  unchanged  on  different  hardware  plat¬ 
forms,  an  extra  layer  of  software  which  contains 
two  modules  is  introduced.  These  tv  '  modules, 
INTERCOM  and  POLYCOM  are  small  user-level 
libraries  which  provide  player  programs  with  the  ca¬ 
pability  to  communicate  with  the  game  and  graph¬ 
ics  drivers,  respectively.  We  have  implemented  IN¬ 
TERCOM  and  POLYCOM  on  NCUBE-1  with  a 
Real-Time  Graphics  board,  and  SUN  386i  with  a 
Transputer  board. 

Implementation  of  INTERCOM 

The  migration  of  the  CrOS-based  Asteroid  to 
EXPRESS-based  is  straight-forward  and  does  not 
deserve  further  discussion.  We  start  the  discussion 
with  the  implementation  of  INTERCOM. 

Common  to  most  distributed-memory  concur¬ 
rent  computers  is  a  control  processor  (CP)  which 
usually  runs  a  version  of  Unix  or  Unix-like  oper¬ 
ating  system  such  as  SUN-OS  on  a  SUN  386i,  and 
AXIS  on  an  NCUBE.  These  operating  systems  sup¬ 
port  multi-tasking  on  the  CP.  A  simple  approach 


to  implement  INTERCOM  is  to  make  use  of  Unix- 
style  pipe.  Even  though  AXIS  does  not  provide  sys¬ 
tem  support  for  pipe  communication  on  the  CP,  it 
is  not  difficult  to  implement  such  a  mechanism.  Us¬ 
ing  pipes,  the  game  driver  and  the  player  programs 
which  run  on  the  same  space-shared  parallel  com¬ 
puter  can  communicate  with  one  another  on  the  CP. 
However,  this  method  is  very  inefficient  and  is  not 
suitable  for  real-time  simulations,  especially  when 
the  CP  has  to  perform  many  other  tasks  besides 
handling  the  game  processes.  It  is  more  acceptable 
if  inter-program  communication  takes  place  within 
the  parallel  computer  or  via  special  high-speed  I/O 
channels. 

On  an  NCUBE-1  with  a  Real-Time  Graphics 
board,  inter-program  communication  can  take  place 
via  the  graphics  board  which  has  16  high-speed  I/O 
channels  to  the  main  array.  Since  VERTEX  only 
checks  on  the  destination  of  messages  that  originate 
from  a  processor  in  the  main  array,  we  made  use 
of  the  graphics  board  to  handle  message  routing 
to  different  sub-cubes  of  the  NCUBE  hypercube. 
When  a  player  program  (in  a  sub-cube)  sends  a 
message  to  the  game  driver  (in  another  sub-cube,) 
the  message  is  actually  being  routed  through  the 
graphics  I/O  board.  The  high-level  INTERCOM 
library  provides  the  service  transparently  with  the 
help  of  a  set  of  message  forwarding  routines  in  the 
FWDLIB  library  which  has  to  be  downloaded  to  the 
16  graphics  nodes  before  loading  the  game  driver 
and  the  player  programs  onto  distinct  sub-cubes  in 
the  main  array  . 

Since  EXPRESS  does  not  check  on  the  des¬ 
tination  of  a  message  and  it  is  the  native  operat¬ 
ing  system  running  on  each  TVansputer  processor, 
inter-program  communication  between  player  pro¬ 
grams  and  the  game  driver  can  take  place  entirely 
within  the  Tiransputer  network.  For  portability,  an 
equivalent  INTERCOM  library  is  written  on  top  of 
EXPRESS  for  a  SUN  386i  Transputer-based  sys¬ 
tem.  In  this  case,  no  FWDLIB  library  is  needed. 

The  INTERCOM  library  for  the  game  is  very 
simple.  There  are  only  four  routines  available. 
At  the  beginning  of  a  player  program,  a  call  to 
play_init()  will  register  the  player  with  the  game 
driver.  The  game  driver  will  be  able  to  find  out 
the  number  of  processors  a  player  occupies,  and  as¬ 
sign  player  number.  When  a  player  makes  a  call 
to  raad-stateO,  a  new  update  of  the  environment 
will  be  returned.  All  nodes  of  a  player  will  receive 
the  same  message  from  the  game  driver.  If  a  player 
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wants  to  send  a  move  to  the  game  driver,  it  makes  a 
call  to  8ttnd_jiiove8().  For  a  player  program  which 
expects  to  receive  input  from  the  keyboard,  a  call 
to  gatJcaysO  will  fill  a  designated  buffer  with  all 
the  keystrokes  received  so  far,  and  the  number  of 
keystrokes  placed  in  the  buffer  wiU  be  returned. 

The  FWDLIB  library  is  implemented  for  the 
NCUBE-1  with  parallel  graphics  only.  It  provides 
communications  between  arbitrary  nodes  in  the 
main  array,  regardless  of  whether  they  are  in  the 
same  allocation  group  or  not.  i'he  library  main¬ 
tains  16  communication  channels,  each  of  which 
stores  the  addresses  of  two  sub-cubes  in  the  main 
array  as  well  as  the  addresses  of  a  particular  node 
in  each  sub-cube  as  a  receiver.  If  an  array  node  in 
one  of  the  two  sub-cubes  sends  a  message  with  a 
call  to  iBdjnsgO,  it  will  be  sent  to  the  receiver  in 
the  other  sub-cube,  where  it  can  be  read  with  a  call 
to  getJtsgO.  An  array  node  can  identify  itself  as 
a  receiver  and  its  allocation  group  as  the  sub-cube 
by  calling  attach-to_chann«l().  To  detach  both 
communicating  sub-cubes  from  a  specific  commu¬ 
nication  channel,  the  parallel  programs  running  in 
the  two  sub-cubes  have  to  call  clear -channeK). 

The  INTERCOM  library  on  NCUBE  makes 
use  of  FWDLIB  library  implicitly.  A  player  pro¬ 
gram  using  INTERCOM  can  communicate  with  the 
game  driver  without  using  or  the  need  to  know  any 
of  the  FWDLIB  routines. 

Implementation  of  POLYCOM 

There  is  a  significant  difference  between  the 
NCUBE-1  and  SUN  386i  graphics  hardware,  as  can 
be  seen  in  Fig.  1  and  2.  A  user  of  the  Asteroids  sys¬ 
tem  who’s  main  concern  is  to  develop  intelligent  al¬ 
gorithms  to  play  the  game  would  not  want  to  spend 
too  much  time  in  experimenting  different  graphics 
display  strategies,  not  to  say  to  deal  directly  with 
the  graphics  hardware  at  a  very  low-level.  We  have 
developed  a  parallel  polygon  graphics  drivers  for  the 
NCUBE  Real-Time  Graphics  board  and  an  equiv¬ 
alent  Sunview-based  graphics  driver  for  the  SUN 
386i. 

On  an  NCUBE,  player  programs  run  on  dis¬ 
tinct  sub-cubes  in  the  main  array,  while  the  paral¬ 
lel  polygon  graphics  driver  runs  on  the  16  graph¬ 
ics  nodes  on  the  Real-Time  Graphics  board.  On  a 
SUN  386i  system  with  no  specialized  graphics  hard¬ 
ware,  player  programs  run  on  the  Transputer  nodes, 
and  the  graphics  driver  runs  on  the  CP,  i.e.,  the 


SUN  386i.  Since  a  graphics  driver  and  player  pro¬ 
grams  run  on  different  processors  with  no  shared- 
memory,  the  player  programs  have  to  send  drawing 
cotiunands  via  messages. 

While  the  graphics  drivers  hide  all  hardware 
details  and  provide  3D  polygon  drawing  capabilities 
for  the  players,  POLYCOM  is  a  small  library  which 
furnishes  a  consistent  set  of  user-level  routines  for 
player  programs  to  communicate  with  the  graphics 
driver.  Player  programs  using  POLYCOM  can  send 
drawing  instructions  to  the  graphics  driver  without 
knowing  where  it  is. 

POLYCOM  supports  drawing  points,  lines, 
and  filled  polygons.  Simple  functions  like  pointO, 
pointsetO,  lineO,  and  polylineO  are  avail¬ 
able.  It  can  draw  background  stars  by  star(}  or 
starsetO  for  any  space  games.  It  also  supports 
polygon  drawing  by  the  the  function  calls  polyO  or 
polyset  0  which  draws  a  collection  of  one  or  more 
filled  or  wire-frame  polygons.  Fundamental  graph¬ 
ics  routines  like  ginitO  for  initializing  the  graph¬ 
ics  library,  reseting  the  clipping  boundaries,  and 
clearing  the  screen,  setclipO  for  setting  the  clip¬ 
ping  boundaries,  setcolor  ()  for  changing  the  RGB 
vcdues  of  a  palette  entry,  dma()  for  making  draw¬ 
ing  visible  by  sending  images  to  the  frame  buffer, 
gcndO  and  gcmdjiodinaO  for  executing  the  accu¬ 
mulated  drawing  commands  with  or  without  auto¬ 
matic  calling  of  dnaO  are  also  provided  by  POLY¬ 
COM. 

Asteroids  on  NCUBE  and  TVansputers 

The  Asteroids  game  wee  implemented  both  on 
NCUBE  with  parallel  graphics  and  SUN  386i  with 
a  Transputer  board.  It  uses  EXPRESS,  INTER¬ 
COM,  and  POLYCOM  for  intra  and  inter-program 
communication.  The  overall  relationships  of  the 
three  category  of  programs  and  the  communication 
among  them  are  illustrated  in  Fig.  3  and  4. 

Oval  shape  is  used  for  a  process,  and  rect¬ 
angular  box  is  used  to  differential  the  three  types 
of  programs  in  Fig.  3  and  4.  The  top  level  of  a 
box  indicates  the  type  of  program,  while  the  lower 
levels  show  the  libraries  in  use  by  the  program.  EX¬ 
PRESS  is  not  included  in  the  boxes  because  we  have 
assumed  that  it  is  available  and  is  being  used  for 
programs  that  require  intra-program  communica¬ 
tion.  Bi-directional  arrows  in  the  figures  indicate 
links  for  inter-program  conununication,  while  uni¬ 
directional  arrows  show  the  parent  and  child  rela¬ 
tionship  of  processes.  Although  there  is  a  batch 
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player  program  in  both  Fig.  3  and  4,  it  has  not 
been  developed  yet.  The  two  figures  just  assume 
that  competing  player  programs  exist. 
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Figure  3:  Schematic  of  program  rela¬ 
tionship  in  NC'UllE  Asteroids. 
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general  guidelines,  both  the  Asteroids  game  driver 
and  the  associated  interactive  player  programs  can 
be  migrated  from  an  NCUBE-1  with  a  Real-Time 
Graphics  board  to  a  SUN  386i  Transputer-betsed 
system  with  absolutely  no  change  of  codes. 
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Figure  4:  Schematic  of  program  rela¬ 
tionship  in  SUN  386i  Asteroids. 

Conclusions 

We  have  presented  guidelines  to  port  com¬ 
municating  programs,  both  parallel  or  sequential, 
which  space-shared  a  distributed  memory  concur¬ 
rent  processor  environment.  Specifically,  we  dis¬ 
cussed  porting  a  version  of  NCUBE  Asteroids  and 
an  associated  interactive  graphics  interface  for  a  hu¬ 
man  player  to  a  SUN  386i  with  a  multi-processor 
'Transputer  board.  We  showed  that,  following  the 
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ABSTRACT 

We  describe  a  general  framework  for  building  and  running 
complex  time-driven  simulations  with  several  levels  of 
concurrency.  The  framework  has  been  implemented  on  the 
Caltech/JPL  Mark  Illfp  hypercube  using  the  Centaur 
communications  protocol.  Our  framework  allows  the 
programmer  to  break  the  hypercube  up  into  one  or  mote 
subcubes  of  arbitrary  size  (task  parallelism).  Each  subcube 
runs  a  separate  application  using  data  parallelism  and 
synchronous  communications  internal  to  the  subcube. 
Communications  between  subcubes  are  performed  with 
asynchronous  messages.  Subcubes  can  each  define  their  own 
parameters  and  commands  which  drive  their  particular 
application.  These  are  collected  and  organized  by  the  Control 
Processor  (CP)  in  order  that  the  entire  simulation  can  be 
driven  from  a  single  command-driven  shell.  This  system 
allows  several  programmers  to  develop  disjoint  pieces  of  a 
large  simulation  in  parallel  and  to  then  integrate  them  with 
little  effort.  Each  programmer  is,  of  course,  also  able  to  take 
advantage  of  the  separate  data  and  I/O  processors  on  each 
hypercube  node  in  order  to  overlap  calculation  and 
communication  (on-board  parallelism)  as  well  as  the 
pipelined  floating  point  processor  on  each  node  (pipelined 
processor  parallelism). 

We  show,  as  an  example  of  the  framework,  a  large  space 
defense  simulation.  Functions  (sensing,  tracking,  etc.)  each 
comprise  a  subcube;  functions  are  collected  into  defense 
platforms  (satellites);  and  many  platforms  comprise  the 
defense  architecture.  Software  in  the  CP  uses  simple  input  to 
determine  the  node  allocation  to  each  function  based  on  the 
desired  defense  architecture  and  number  of  platforms  simulated 
in  the  hypercube.  This  allows  many  different  architectures  to 
be  simulated.  The  set  of  simulated  platforms,  the  results,  and 
the  messages  between  them  are  shown  on  color  graphics 
displays.  The  methods  used  herein  can  be  generalized  to  other 
simulations  of  a  similar  nature  in  a  straightforward  manner. 

I.  INTRODUCTION 

Many  applications  in  scientific  computing  cannot  be 
solved  with  the  homogeneous  approach  traditionally  used 
with  hypercube  multicomputers.  Solutions  to 
inhomogeneous  problems  are  required  by  such  applications  as 
electronic  circuit  simulations,  war  games,  simulations  of 
spacecraft  systems,  simulations  of  national  or  world 
economies,  etc.  Often  such  applications  involve  a  degree  of 


time-dependence.  That  is,  the  character  of  the  solution 
evolves  with  time.  We  call  such  applications 
inhomogeneous  time-driven  simulations  and  they  are 
characterized  by  the  following  features;  1)  They  are 
composed  of  TASKS  with  various  degrees  of  workload.  2) 
Tasks  communicate  with  one  another  to  perform  the 
simulation.  3)  Each  task  has  a  COMPUTATION 
CYCLE  which  is  repeated  many  times  duration  the 
simulation.  4)  Each  cycle  has  four  phases:  a)  reception  of 
data  from  other  tasks,  b)  processing  of  that  data,  c) 
communication  of  results  to  other  tasks,  d)  advancing 
simulation  time  t.  S)  Different  tasks  may  t^e  different 
amounts  of  simulation  time  to  perform  their  computation 
cycles,  as  well  as  taking  different  amounts  of  real  time. 

We  have  developed  a  general  simulation  framework  for 
building  and  running  such  inhomogeneous  time-driven 
simulations  on  hypercubes.  The  goals  of  our  framework  are 
to; 

1.  run  tasks  in  parallel  fw  maximum  speed-iq); 

2.  load  balance  the  processing  power  of  the 
hypercube  nodes  so  CPU-intensive  tasks  receive 
mtxe  CPU  cycles  than  simple  tasks; 

3.  keep  tasks  distinct  so  they  can  be  added,  deleted, 
or  replaced  at  will  --  even  at  run  time  (however, 
we  do  not  support  the  addition,  deletion,  or 
migration  of  taste  during  the  simulation); 

4.  allow  multiple  instances  of  each  task  to  be 
simulated,  the  number  of  such  instances  also 
being  determined  at  run  time; 

5.  develop  a  communication  system  which  can 

a.  determine  which  tasks  communicate  with 
each  other  and  with  what  kind  of  data  (at 
present  we  allow  such  dynamic  configuration 
to  occur  only  at  run  time,  but  we  plan  to 
support  dynamic  reconfiguration  during  the 
simulation  in  the  near  future). 
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b.  react  to  the  inclusion  of  additional  task 
instances  as  well  as  the  non-inclusion  of 
other  tasks  by  developing  an  appropriate 
communication  graph  (again  supported  only 
at  the  beginning  of  the  run  at  present), 

c.  keep  messages  in  proper  simulation  time  and 
real  time  sequence,  deliver  them  at  the 
corect  simulation  time,  and  keep  the  system 
from  deadlocking; 

6.  allow  the  simulation  to  be  controlled  by  the  user 
from  a  single  location,  despite  its  multifaceted 
character. 

In  this  paper  we  describe  the  methods  by  which  we  have 
implement^  such  a  simulation  framework  and  then  discuss, 
as  an  example,  a  large  space  defense  simulation  - 
"Simulation  88"  -  which  makes  use  of  at  least  five  difierent 
levels  of  parallelism  available  in  the  JPL/Caltech  Mark  Illfp 
hypercube.  We  believe  that  Simulation  88  is  one  of  the 
most  sophisticated  applications  run  on  a  hypercube  to  date. 


n.  THE  GENERAL  HYPERCUBE 
SIMULATION  FRAMEWORK 

A.  Mixed  Task  and  Data  Parallelism  Using  the 
Centaur  Operating  System 
Goals  1  •  4  are  achieved  in  the  following  manner.  Each 
task  is  decomposed  onto  a  SUBCUBE  of  the  hypercube 
(task  parallelism).  As  well  as  possible,  the  number  of  nodes 
in  each  subcube  is  kept  approximately  proportional  to  the 
task  workload  per  computation  cycle  divided  by  Lhe  desired 
simulation  time  per  cycle  (the  throughput  of  the  subcube). 
(Of  course,  the  numbCT  of  nodes  in  each  subcube  must  be  a 
power  of  2.)  Each  subcube  has  a  designated  master  node,  the 
CORNER  NODE,  which  communicates  with  corner  nodes 
of  other  subcubes. 

Within  each  task  the  computation  is  generally 
homogeneous.  Therefore,  algorithm  speed-up  is  accomplished 
using  data  parallelism  algorithms,  i.e.,  the  uaditional 
homogeneous  algorithms  often  proposed  for  hypercubes  [1]. 

All  communications,  whether  within  or  between 
subcubes,  are  handled  by  the  CENTAUR  OPERATING 
SYSTEM.  [2]  Within  a  subcube,  the  programmer  uses 
fast  synchronous  communication  subroutine  calls  (those  from 
the  so-called  "crystalline  operating  system"  or  CrOS  portion 
of  Centaur).  Between  subcubes,  specifically  between  corner 
nodes,  and  in  communications  with  the  outside  world,  the 
programmer  uses  asynchronous  communication  subroutine 
calls  (those  from  the  "Mercury"  portion  of  Centaur).  In 
Figure  1  we  show  a  generic  example  of  a  32-node  hypercube 
decomposed  into  eight  (8)  separate  subcubes,  each  of  which  is 
an  instance  of  one  of  five  distinct  tasks. 


Our  scheme  for  decomposing  the  hypercube  into  subcube 
tasks  is  described  as  below.  Consider  the  following  input 
parameters: 

D  ~  dimension  of  full  hypocube 

“  real  time  for  task  i  to  run  one  cycle  on  one 
hypercube  node 

Atj  ~  desired  simulation  time  for  one  cycle  of  task  i 
Dj  -  number  of  instances  desired  for  tasks  i 
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To  solve  the  decomposition  problem,  one  n»„si  solve  fw  the 
dimension  of  each  task's  subcube  dj,  subject  to  the  following 
constraints: 

Each  task's  throughput  must  be  load  balanced  as  well  as 
possible:  TPy  =  TP2  =  TP3  s  . . .,  where  TPi  = 

2‘*i  Aii  /  Atii 

Each  task  has  at  least  one  node:  di  >  0 
Tasks  must  fill  hypercube:  ^  nj  =  2® 


These  constraints  are  satisfied  by  the  following  algorithm: 

1.  Initialize  all  dj  =  0 

2.  Compute  the  throughput  TPj  of  each  subcube  i 
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3.  Choose  the  subcube  j  with  the  lowest  throughput 
and  connpute  the  result  of  attempting  to  double  the 
number  of  nodes  for  subcube  j: 

Ntest  =  X"i  +  nj  ^ 

4.  If  Ntest  5  2*>.  replace  dj  «-  dj  +  I;  else  freeze 
dj  ,  but  continue  searching  for  lowest  throughput 
among  other  subcube  tasks  (i^j) 

5.  If  all  dj  have  been  frozen,  exit  the  above  loop, 
advancing  to  step  8;  else  to  go  step  2 

6.  If  the  final  Ntest  <  2*^,  then  there  are  spare  nodes 

left,  but  there  is  no  task  which  needs  them  or  which 
can  be  doubled  to  fill  them.  Instead,  fill  them  with 
a  null  task. 


B.  Constructing  the  Communications  Graph 
(Between  Corner  Nodes) 

Once  the  hypercube  has  been  partitioned  into  subcubes, 
the  set  of  communication  links  among  the  subcubes  ~  the 
communications  graph  —  must  be  specified.  This  graph,  and 
the  communications  calls  made  during  the  simulation,  are  the 
key  elements  of  our  simulation  framework.  They  ensure  that 
the  conect  data  are  passed  between  tasks  at  the  proper 
simulation  times  so  that  the  tasks  can  continue  to  perform 
their  computations  without  deadlock. 

One  important  requirement  of  the  communication  system 
is  that  it  must  be  able  to  build  a  graph  given  the  number  of 
tasks  and  their  instances  available  in  the  hypercube  at  run 
lima.  It  would  be  very  cumbersome  for  the  user  if  he  were 
required  to  manually  reconfigure  the  communication  links 
every  time  he  added  or  deleted  a  single  task  instance,  or 
changed  a  task's  ^oughput,  thereby  changing  the  number  of 
nodes  devoted  to  it.  We  have  therefore  implemented  a  general 
scheme  where  the  user  specifies  GENERAL 
COMMUNICATION  LINKS,  which  are  valid  under  a 
wide  variety  of  circumstances.  These  general  links  are  then 
used  by  the  framework  to  construct  SPECIFIC 
COMMUNICATION  UNKS  at  run  Ume.  The  user 
need  not  know  the  number,  size,  or  position  in  the  hypercube 
of  the  subcube  tasks  in  order  to  use  this  general  scheme. 

For  each  general  link  the  usw  must  specify: 

The  type  of  data  being  sent  (a  master  list  of  allowed 
message  types  must  be  defined  and  be  made  part  of  the 
framework); 

The  sending  task  lypfi  (but  not  the  instance  nor  the  node 
number); 


The  receiving  task  type: 

A  COMMUNICATION  RULE  specifying  to  which 

of  the  several  possible  instances  of  sending  tasks  the 

receiving  task  should  LISTEN  for  this  message  type. 

Note  that  this  "simulation  mapping”  process  requires 
algorithms  specific  to  the  simulation  being  performed.  It 
must  be  modified  for  each  new  simulation  being  developed. 

The  specification  of  "to  whom  to  listen",  rather  than  "to 
whom  to  send",  is  important.  One  can  be  derived  from  the 
other,  but  it  is  much  easier  to  construct  the  latter  from  the 
former  and  insure  that  all  tasks  receive  the  data  they  need.  In 
addition,  by  avoiding  multiple  sources  of  data  for  each  type, 
this  method  insures  that  most  (if  not  all)  messages  sent  will 
be  picked  up  and  used  by  the  receiving  subcube.  (This  is 
useful  as  hypercube  nodes  have  a  finite  amount  of  memory 
and  cannot  afford  to  leave  a  large  number  of  unread 
messages.)  There  is  also  no  need  to  create  data  arbitration 
algorithms  for  each  communication  reception  to  handle  the 
case  when  more  than  one  message  of  a  given  type  arrives. 
However,  this  feature  limits  our  framework  to  those 
simulations  where  tasks  have  only  one  source  for  any  given 
data  type.  (Of  course,  additional  data  types  can  be  defined  to 
maintain  the  flexibility  needed  in  most  situations.) 

At  run  time  the  general  links  are  used  to  construct  the 
specific  communication  links.  A  specific  link  is  defined  by 

1)  the  sending  subcube’s  comer  node  number  and  task  type, 

2)  the  receiving  subcube’s  comer  node  number  and  task  type! 
and  3)  the  message  type.  All  links  involving  a  single  comer 
node  are  stored  on  that  node  in  a  lookup  uble.  When  a 
SEND  of  a  cer^n  message  type  is  executed  by  a  task,  the 
intended  receiving  subcubes’  type  must  also  be  specified  in 
the  communication  call.  (Only  one  call  is  needed  to  send  to 
all  subcubes  of  the  same  task  type,  but  multiple  sends  must 
be  performed  if  the  same  message  type  is  intended  fc»  more 
than  one  task  type.)  The  framewoik  software  then  looks  in 
the  table  for  all  links  with  the  proper  receiving  task  type  and 
delivers  the  message  to  them.  If  no  links  satisfy  the  critwia, 
the  call  is  ignored,  but  an  error  code  is  returned.  Likewise, 
when  a  RECEIVE  of  a  message  type  is  executed,  the  sr^tware 
first  checks  the  lookup  table  to  see  if  the  link  has  been 
defined. 

The  above  scheme  avoids  deadlock  in  two  cases:  when  a 
specific  link  is  defined,  but  SEND  and/or  RECEIVE  are  not 
called;  and  when  a  link  is  not  defined,  and  SEND  or 
RECEIVE  is  called.  Nevertheless,  the  tasks  must  be  coded 
carefully  as  problems  can  still  occur.  Deadlock  will  occur  if 
a  link  is  defined  and  a  RECEIVE  is  called  by  one  task,  but 
the  sending  task  specified  by  the  link  has  not  called  a  S^D. 
Data  overflow  can  occur  if  a  link  is  defined  and  a  SEND  is 
called  by  one  task,  but  the  corre^nding  RECEIVE  is  not. 
The  message  queue  on  the  receiving  task  then  grows  linearly 
with  time. 
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C.  The  Synchronization  of  Tasks  Using  Message 
Passing,  The  Control  of  Simulation  Time,  And 
The  Assurance  of  Task  Parallelism 

A  message-passing  system  works  properly  only  if  the 
messages  are  sent  and  received  at  the  proper  times.  Therefore, 
message  passing  cannot  be  considered  without  also 
considering  the  flow  of  time  in  the  simulation  and  the 
method  by  which  the  tasks  are  synchronized.  In  our 
framework,  synchronization  is  controlled  by  the  message 
passing,  just  as  it  is  in  homogeneous  applications,  by 
forcing  the  receiving  task  to  wait  until  it  has  received  a 
message  which  satisfies  certain  criteria. 

Each  task  has  its  own  internal  clock  which  advances 
simulation  time  in  fixed  increments  of  Aij.  (Simulation 
time  increments  of  different  task  types  do  not  have  to  be  the 
same.)  Furthermore,  in  addition  to  the  normal  message 
header  information  which  Centaur  places  on  the  message,  our 
framework  also  time  tags  each  message  with  the  simulation 
time  at  which  it  is  sent.  A  message  sent  by  task  j  at 
simulation  time  Tj  and  received  by  task  i  at  time  tj  is 
accepted  only  if  its  time  tag  tj  is  in  the  interval 

tj  •  a  Atj  <  tj  <  tj  +  (1-a)  Atj  . 

(a  is  a  parameter  which  describes  the  type  of  message 
acceptance:  a=l  denotes  backward-biased,  a=0  denotes 
forward-biased,  and  a=l/2  denotes  time-centered  acceptance.) 
Messages  which  are  accepted  are  read  from  the  queue  but  not 
discarded;  only  messages  with  tags  tj  <  tj  -  aAtj  are 
deleted.  Note  that  the  acceptance  time  interval  is  determined 
by  the  .sender's  simulation  clock  "tick"  (Atj)  and  not  the 
receiver's.  This  avoids  deadlock  regardless  of  whether  the 
ratio  Atj  /  Atj  is  less  than  or  greater  than  unity.  If  a 
message  of  the  correct  type,  sending  node,  and  time  tag  is  not 
in  the  queue,  the  receiving  task  waits  until  one  arrives.  This 
synchronizes  the  tasks. 

It  is  possible  with  such  a  synchronization  scheme  to  force 
the  tasks  to  execute  in  a  sequential  fashion  and  not  in 
parallel!  That  is,  it  is  possible  that  only  one  task  at  a  time  is 
performing  any  computations  and  that  all  the  other  tasks  are 
waiting,  especially  if  the  communication  graph  has  one  or 
more  closed  loops  embedded  in  it  This  serial  processing  can 
be  avoided  if  each  task  executes  its  operations  in  each  cycle  in 
a  particular  (Hxler: 

1 .  Send  all  messages;  if  there  is  no  data  to  send,  still 

send  a  null  message  (header); 

2.  Receive  all  messages  from  other  tasks; 

3.  Perform  CPU-intensive  work 

4.  Advance  simulation  time  by  Axj  seconds 


Sending  all  messages  first  "primes  the  pump"  and  allows 
other  tasks  to  continue  executing  in  parallel,  especially  when 
closed  loops  exist  in  the  graph.  Advancing  the  simulation 
time  after  the  computation  emulates  the  passing  of 
simulation  time  during  the  computation  portion  of  the  cycle. 

D.  Centralized  Simulation  Control 
(The  C3PO  System) 

Control  of  the  execution  of  the  simulation  is  provided  by 
a  program  running  in  the  Control  Processor  of  the  hypercube 
(see  Figure  1):  C3PO  (Command  and  Parameter  Processor 
for  Program  Organization).  After  the  hypercube  is 
partitioned,  and  before  the  communication  links  are  set  up, 
each  subcube  task  defines  a  set  of  parameters  and  commands 
which  control  that  task.  (Typical  parameters  are  names  of 
initialization  files,  printing  and  plotting  flags,  etc.;  typical 
commands  are  initialization,  starting  the  execution  of  the 
cycles  in  each  task,  and  commands  which  alter  the  simulation 
during  execution  such  as  shutdown.)  The  parameters  and 
command  names  and  types  are  sent  up  to  C3PO  in  the  CP 
where  they  are  stored  in  a  symbol  table.  All  tasks  then  listen 
to  C3PO  for  commands  and  continue  to  do  so  even  while 
executing  other  commands. 

A  one-word  command  issued  by  the  user  at  the  C3PO 
prompt  will  execute  a  subroutine  on  all  subcubes  which 
recognize  that  command.  In  addition  to  commands,  the 
C3PO  interpreter  also  executes  a  C-Iike  language.  Parameter 
values  may  be  altered  at  the  C3PO  level  with  assignment 
statements,  C3PO  functions,  etc.  Each  command  issued  will 
use  the  latest  values  of  the  parameters. 

For  sophisticated  simulations  the  C3PO  program  itself 
can  be  a  task  with  its  own  set  of  parameters  and  commands. 
This  is  most  useful  during  the  pre-initialization  phase  of  a 
simulation  when  the  partitioning  of  the  hypercube  into  tasks 
is  (fetermined. 

III.  APPLICATION  TO  A  COMPLEX 
STRATEGIC  DEFENSE  SIMULATION: 

SIMULATION  88 

We  have  used  the  above  framework  to  construct  a  detailed 
simulation  of  a  strategic  defense  system.  This  simulation, 
called  Simulation  88,  is  an  emulation  of  a  portion  of  a 
constellation  of  missile  sensors,  trackers,  battle  managers, 
and  weapons  platforms.  Simulation  88  is  composed  of  the 
following  major  tasks,  each  of  which  is  a  separate  C 
program:  SWIR  (short-wave  infrared)  sensor:  tracker  of 
S  WTR  sensor  data  capable  of  stereo  processing;  LWIR  (long¬ 
wave  infrared)  sensor:  tracker  of  LWIR  sensor  data  cs^ble  of 
stereo  processing;  a  global  engagement  manager  which 
allocates  wea|x>ns  in  the  arsenal  based  on  ability  to  engage 
and  the  probability  of  kill;  a  fire  control  module  which 
schedules  weapon  release  and  performs  guidance;  an 
environment  generator  which  launches  the  threat,  flies  the 
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SDI  platforms,  and  generally  takes  care  of  functions 
performed  by  the  enemy  or  by  nature;  and  a  simulation 
monitor  which  doubles  as  the  null  task  when  not  running  on 
node  0  of  the  hypercube.  In  addition  to  being  able  to 
communicate  with  one  another  to  assess  and  respond  to  the 
threat,  most  tasks  can  also  open  one  or  more  windows  on 
external  color  graphics  workstations  for  viewing  the 
simulation  progress.  The  amount  of  C  code  running  on  these 
workstations  is  nearly  equal  to  that  running  in  the  hypercube. 

Simulation  88  has  been  designed  and  implemented  in  an 
unclassified  environment.  However,  it  is  parameterized 
(through  the  use  of  C3PO  parameters  and  initialization  files) 
so  that  it  can  be  run  in  a  classified  manner  in  the  proper 
environment 


A  run  of  Simulation  88  is  uniquely  determined  by  a 
configuration  file  which  defines  1)  which  tasks  are  active,  2) 
how  tasks  are  bundled  together  to  form  SDI  platforms,  3)  the 
total  number  of  platforms  of  each  type  and  their  orbits,  4) 
how  tasks  communicate  (the  general  links),  5)  how  many 
platforms  of  each  type  we  wish  to  emulate  in  the  hypercube 
(the  rest  are  simulated  in  lower  fidelity),  and  6)  how  large  a 
hypercube  we  wish  to  use  for  the  simulation.  All  this 
information  is  parsed  by  the  C3PO  program  before  the 
hypercube  is  booted.  After  all  executable  code  is  downloaded 
info  the  hypercube,  the  specific  platforms  to  be  emulated  are 
chosen  from  the  constellations  according  to  which  ones  can 
fight  the  battle  best.  All  communication  links  between 
platforms  are  constructed,  but  only  that  subset  which 
involves  the  chosen  platforms  is  used  for  actual  Centaur 
communications  between  tasks.  Figure  2  shows  the  node 
allocation  which  results  from  a  typical  64-node  run. 
Increasing  the  cube  dimension  to  seven  (7),  for  example, 
would  not  change  the  number  of  modules  but  would  change 
the  numbers  of  nodes  allocated  to  each. 


Simulation  88  makes  use  of  at  least  five  different  levels 
of  parallelism: 

Multi-machine  parallelism:  graphics  processing  and 
display  occur  in  parallel  with  the  hypercube  simulation 
computations; 

Task  parallelism  within  the  hypercube:  the  simulation 
is  subdivided  into  subcubes;  C3PO  in  the  CP  is  also  a 
task; 

Data  parallelism  within  each  subcube  of  dimension  d|  > 
0:  each  task  occupies  2*^1  nodes; 

Intra-node  parallelism:  each  task's  code  runs  in  the 
68020/68882  processor  or  in  the  Weitek  floating  point 
processor;  the  Centaur  communications  is  performed  in 
parallel  by  a  separate  68020  on  each  hypercube  node; 

Pipelined  parallelism:  some  tasks  execute  their  code  in 
the  Weitek  floating  point  processor  of  the  Mark  Illfp 
hypercube;  this  processor  accomplishes  parallelism  on  a 
machine  instruction  level. 
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ABSTRACT 

In  this  paper,  we  discuss  the  implementation  of  Bitz  and 
Kung's  path  planning  algorithm  on  a  ring  of  general- 
purpose  processors.  We  show  that  Bitz  and  Kung's 
algorithm,  originally  designed  for  the  Warp  machine,  is 
not  efficient  in  this  context,  due  to  the  intensive  inter¬ 
processor  communications  that  it  requires.  We  design  a 
modified  version  that  performs  much  better.  The  new 
version  updates  a  segment  of  k  positions  within  a  step  and 
allocates  blocks  of  r  consecutive  rows  of  the  map  to  the 
processors  in  a  wraparound  fashion.  Bitz  and  Kung’s 
algorithm  corresponds  to  the  situation  (k,r)  =  (1,1).  We 
analytically  determine  the  optimal  values  of  the 
parameters  (k,r)  which  minimize  the  parallel  execution 
time  as  a  function  of  the  problem  size  n  and  of  the 
number  of  processors  p.  The  theoretical  results  are  nicely 
corroborated  by  numerical  experiments  on  a  ring  of  32 
Transputers. 


1.  INTRODUCTION 

Given  a  map  on  which  each  position  is  associated  with  a 
traversability  cost,  the  path  planning  problem  is  to  find  a 
minimum-cost  path  from  a  source  position  to  every  other 
position  in  the  map  :  look  at  the  artincial  example  of 
figure  1.  The  altitudes  of  the  points  of  this  surface  are 
proportional  to  their  traversability  costs.  The  top  of  the 
spiral-shaped  hill  has  a  high  traversability  cost,  while  the 
bottom  of  the  valley  is  easier  to  go  through.  The  source 
lies  in  the  center  of  the  spiral.  We  plot  here  a  shortest 
path  from  the  border  of  the  domain  to  the  source.  We 
clearly  see  that  the  path  hesitates  between  walking  in  the 
valley  (long  but  easy),  and  crossing  the  hill  (shorter  in 
distance,  but  more  costly). 

Bitz  and  Kung  [BK]  have  recently  proposed  a  dynamic 
programming  algorithm  to  solve  the  problem,  and  they 
have  mapped  this  algorithm  onto  the  linear  systolic  array 
in  the  Warp  machine  [AAG].  We  show  that  Bitz  and 


Kung's  algorithm  is  not  efficient  in  the  context  of  general 
purpose  processors,  due  to  the  intensive  communication 
scheme  that  it  requires. 


Figure  1:  A  shortest  path  on  an  artificial  map. 

2.  PATH  PLANNING  ALGORITHM 

A  map  M  is  an  n  X  n  grid  of  positions,  for  some  positive 
integer  n.  The  eight  neighbors  of  a  position  p  are  indicated 
by  the  corresponding  cardinal  point  in  the  compass  (see 
figure  2). 

MW  N  ME 

W  p  E 
SW  S  SE 

Figure  2  :  Labeling  the  eight  neighbors  of  a  position  p 

Each  position  p  is  associated  with  a  non-negative  real- 
number  tc(p)  corresponding  to  the  uaversability  cost  of 
the  position.  Given  a  position  p  and  a  neighbor  q  of  p,  the 
edge  ^,q)  is  weighted  with  a  cost  c(p,q)  =  (tc(p)+tc(q))/2  if 
q  e  (N,S,WJE}  and  c(p,q)  =  (u:(p)+tc(q))v2/2  otherwise: 
the  y2  multiplier  reflects  the  added  traveling  distance  due 
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to  the  diagonal  connection.  Given  a  position,  called  the 
source,  we  want  to  compute  the  shortest  path  (or 
minimum-cost  path)  from  it  to  every  position  in  the  map. 

Bitz  and  Kung  [BK]  propose  a  dynamic  programming 
algorithm  to  solve  the  problem.  Initially,  the  best  known 
cost  f(p)  for  every  position  p  in  the  map  is  assigned  the 
value  0  at  the  source  and  <»  at  all  other  positions.  The 
algorithm  performs  a  succession  of  red  and  blue  sweeps  of 
the  map. 

2.1.  RED  SWEEP 

The  red  sweep  is  a  forward  scan  of  the  map  M  in  the  row- 
major  ordering.  During  the  red  swtep,  each  position  p  is 
updated  according  to  the  red  mask  depicted  in  figure  3. 


Figure  3  :  Red  sweep  and  the  associated  mask 

For  the  current  position  p  of  the  sweeping,  we  update  the 
best  known  cost  f(p)  if  there  exists  a  better  path  passing 
by  one  of  the  red  neighbors  of  p.  For  instance  if  the  best 
known  cost  f(W)  of  the  west  neighbor  of  p  plus  the  cost 
c(W,p)  of  the  edge  from  W  to  p  is  smaller  than  f(p),  we 
update  f(p)  into  f(p)  :=  f(W)  +  c(W,p).  In  the  general  case, 
die  update  of  f(p)  is  defined  as 

f(p)  :=  min(  f(p),  f(W)  +  c(W,p).  f(NW)+  c(NW.p).  f(N)  + 
c(N,p),  f(NE)^  c(NE,p) ) 


that  is 

(RS)  f(p)  ;=  min(  f(p), 

f(W)  +  (tc(p)+tc(W))/2, 
f(NW)+  (tc(p)+tc(NW))V2/2. 
f(N)  +  (tc(p)+tc(N))/2, 
f(NEK(tc(p>+tc(NE))V2/2 ) 

2.2.  BLUE  SWEEP 

The  blue  sweep  scans  the  map  M  in  the  reversed  tow- 
major  ordering  as  shown  in  figure  4.  For  the  current 
position  p  of  the  sweeping,  the  update  of  f(p)  is  defined 
similarly  as  for  the  blue  sweep,  but  using  the  blue 
neighbors  instead  of  the  red  ones: 


Figure  4  :  Blue  sweq)  and  the  associated  mask 

f(p)  :=  min(  f(p).  f(E)  +  c(E.p).  f(SE>f  c(SE.p).  f(S)  + 
c(S.p).f(SW)+c(SW.p)) 


that  is 

(BS)  f(p)  :=  min(  f(p), 

f^)  +  (tc(p)+tc(E))/2, 
f(SE)+  (tc(p)+tc(SE))V2/2. 
f(S)  +  (tc(p)+tc(S))/2, 
f(SE)+(tc(p)+tc(SE))V2/2 ) 


2.3.  PATH  PLANNING  ALGORITHM 
Given  the  initial  values  stated  above,  the  red  and  blue 
sweeps  are  performed  alternatively  until  no  values  are 
changed  in  one  sweep.  Let  us  color  the  edges  of  a  path 
according  to  their  directions:  edges  pointing  to  W,  NW,  N 
and  ME  directions  are  colored  blue,  whereas  edges  pointing 
to  E,  SE,  S  and  SW  directions  are  colored  red.  Then  Bitz 
and  Kung  [BK]  show  that  the  number  of  required  sweeps 
before  all  positions  receive  their  final  values  is  C  or  C+l, 
where  C  is  the  maximum  number  of  color  changes  in  a 
shortest  path  from  the  source  to  any  other  position.  Hence 
in  the  worst  case,  the  number  of  required  sweeps  can  be  as 
large  as  O(n^).  However  in  practical  situations,  it  is 
expected  to  be  much  smaller  than  n  [BK]. 


In  the  following,  we  concentrate  upon  the  parallel 
implementation  of  a  single  sweep  (a  red  sweep)  on  a  ring 
of  processors. 

3.  PARALLEL  IMPLEMENTATION 


We  briefly  recall  Bitz  and  Kung's  solution  for  mapping 
the  path  planning  algorithm  onto  the  Warp.  Such  a 
solution  is  not  suited  to  a  ring  of  general-purpose 
processors,  and  we  derive  a  modified  version  that  poforms 
much  better. 


3.1.  BITZ  AND  KUNG’S  MAPPING  METHOD 
We  consider  a  ring  of  processors  numbered  from  0  to  p-1. 
Each  row  of  the  map  is  assigned  to  a  processor.  Assume 
first  that  the  numirer  of  processors  p  is  equal  to  the 
problem  size  n.  In  this  case  processor  i  gets  row  i,  ()<i<n. 
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For  the  red  sweep,  immediately  after  processor  i  has 
computed  the  value  of  two  positions,  it  will  pass  these 
values  to  processor  i+l  to  get  it  started.  We  summarize  in 
figure  S  the  time-steps  at  which  each  position  is  updated. 
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Figure  5  :  Time-steps  for  Bitz  and  Kung's  parallel 
algorithm 

At  time  2i+j,  Processor  Pj  operates  as  follows  (wherever 
indices  make  sense); 

•  it  receives  position  (i-1  J+1)  from  Pj.i 

•  it  updates  position  (i  J) 

•  it  sends  position  (i  j)  to  Pi+i 

When  p  is  smaller  than  n,  partitioning  techniques  must  be 
considered.  Assume  for  the  sake  of  simplicity  that  p 
divides  n.  Bitz  and  Kung  propose  to  assign  the  rows  of  the 
map  to  the  processors  in  a  wraparound  fashion:  processor  i 
gets  rows  j  such  that  j  =  i  mod  p.  The  wrap  mapping  is  a 
widely  used  technique  to  well  balance  the  workload  among 
the  processors  [GH,  MR,  MV,  Saa].  Now  Pq  needs  to 
receive  computed  values  from  Pp.i.  Note  that  Pq  receives 
the  first  value  (p-1,0)  from  Pp.i  at  time  2p-l,  At  time  2p, 
Pq  receives  the  second  value  (p-1,1)  and  updates  position 
(p,0).  Hence  we  do  not  want  Pq  to  finish  the  updating  of 
row  0  before  time  2p,  otherwise  it  would  stay  idle  for  a 
while.  This  imply  that  n  >  2p.  If  n  >  2p,  Pq  will  simply 
store  the  values  it  receives  from  Pp-i  until  it  starts  the 
updating  of  its  second  row. 


3.2.  UPDATING  A  SEGMENT  OF  LENGTH  K 
The  first  way  to  decrease  the  communication  overhead  is 
to  use  longer  messages.  We  use  the  same  mapping 
strategy  as  before,  but  we  update  a  segment  of  k 
consecutive  positions  at  each  step.  The  algorithm  is 
illustrated  figure  6.  Note  that  k  does  not  need  to  be  a 
divisor  of  n.  In  figure  6,  we  let  Iq  be  the  number  of 
positions  updated  by  Pq  at  time  0;  we  choose  Iq  =  k-1  in 
our  implementation,  just  as  if  Pq  had  received  k  fictitious 
values  before  beginning  (but  Iq  can  be  any  number 
between  1  and  k-1).  Each  processor  always  updates  k 
positions,  except  may  be  for  the  first  and  last  updates:  we 
start  the  update  of  the  next  row  while  finishing  the  update 
of  the  current  row  (see  figure  6).  The  condition  for  Pq  not 
to  finish  its  first  row  before  receiving  data  from  Pp.i  will 
be  derived  in  the  next  section:  we  obtain  the  condition  n  > 
(k+1)  p. 


Po  ['  O"  i~  1  I  ?  I  ^  i  <  I  HI  S.  _l  —  L. 
Iq  -j  _  K 

p,  r  r  I  ~  2  I  1  •  r  i  s  i  i  _  s-  i. 

Pj  I  ?  T  ~  I  4  I  s  1  K  I  7.1  a- 

R)  I  i  r  ~  I  ~  s'  I  ■  6  -  L  - La.1  I  a  1  .~i.  : 

First  row  of  each  processor  Second  row 

Figure  6  :  Updating  a  segment  of  length  k 

The  number  of  data  items  communicated  between  two 
neighbor  processors  is  exactly  the  same  as  before,  but  the 
larger  k,  the  more  efficiently  the  communications  are 
performed.  On  the  other  hand,  the  larger  k,  the  greater  the 
latency  between  the  startup  times  of  two  adjacent 
processors.  We  must  be  ready  to  find  a  compromise 
between  the  two  conuadictory  exigences  of  mimimum 
startup  delay  (small  k)  and  inexpensive  communications 
(large  k). 


We  see  that  the  latency  between  the  startup  times  of  two 
adjacent  processors  is  small  (two  time-steps).  The  major 
drawback  of  the  algorithm  is  that  is  involves  many  short 
communications  between  the  processors.  For  current 
distributed  memory  machines,  the  time  to  transfer  L  words 
between  two  adjacent  processors  can  be  modelized  by  P  + 
L  Xc,  and  it  turns  out  that  P  is  significantly  higher  than  Xc 
([GH,  MV,  Saa],  see  also  the  experiments  reported  in 
section  5).  This  renders  the  cost  of  small  messages 
prohibitive. 

We  explain  below  how  to  modify  Bitz  and  Kung's 
algorithm  in  order  to  decrease  the  communication 
overhead.  We  describe  the  new  algorithm  informally,  and 
postpone  its  complexity  analysis  up  to  next  section. 


3.3.  MOVING  TO  NEW  MAPPING  STRATEGIES 
Another  way  to  decrease  the  communication  overhead  is  to 
communicate  less  data  items  between  neighbor  processors. 
We  now  consider  more  general  allocation  functions  than 
the  wrap  mapping,  and  we  assign  blocks  of  r  consecutive 
rows  to  the  processors  in  a  wraparound  fashion  [RTV]. 
For  instance  with  r  =  3,  n  =  36  and  p  =  4  we  have  the 
following  repartition: 
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Such  a  repartition  is  illustrated  figure  7.  Analytically, 
processor  i  gets  rows  j  such  that  i  =  Lj/rJ  mod  p,  0  S  j  < 
n-1. 


Figure  7  :  Block-r  mapping,  n  =  36,p  =  4,  r  =  3 

The  time-steps  are  depicted  in  figure  8.  At  each  step 
except  the  first  and  last  ones,  all  the  processors  update  a 
parallelogram  of  T*k  positions.  Just  as  before  for  r  =  1,  we 
start  the  update  of  the  next  block  while  finishing  the 
update  of  the  current  block  (see  figure  8). 


and  Kung's  implementation,  because  the  processors  only 
need  to  exchange  informations  relative  to  the  boundary 
rows  of  each  block.  Segments  belonging  to  an  internal 
row  of  a  block  do  not  require  any  inter-processor 
communication.  We  illustrate  the  communications 
between  two  neighbor  processors  in  figure  10. 

k 


Figure  10  :  Communications  between  two  processors 


The  price  to  pay  for  such  a  dramatic  reduction  of  the 
communication  volume  is  again  an  increase  in  the  latency 
between  the  startup  times  of  two  adjacent  processors. 
Hence  the  best  value  of  r  will  result  of  a  compromise,  just 
as  the  best  value  of  k. 

In  the  next  section,  we  perform  a  complexity  analysis. 
Given  n  and  p,  we  analytically  determine  the  values  of  k 
and  r  that  minimize  the  parallel  execution  time. 
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Figure  8  ;  Parallel  algorithm,  n=36,  p=4,  r=3  and  k=4 


The  condition  that  k  and  r  must  meet  to  keep  all 
processors  activated  is  the  following:  n  >  p  (r+k)  (sec  next 
section).  We  show  in  figure  9  an  example  where  this 
condition  is  not  met:  we  see  that  Pq  is  idle  at  time  4, 
because  it  has  not  received  in  time  from  P3  the  first 
positions  of  row  11. 


F(fsl  block 


Second  block 


Figure  9  :Parallel  algorithm,  n=36,  p=4,  t=3  and  k=l  I 


Now,  the  number  of  data  items  communicated  between 
two  neighbor  processors  is  r  times  smaller  than  in  Bitz 


4.  PERFORMANCE  EVALUATION 

In  this  section,  we  analyse  the  performances  of  the  parallel 
algorithm  described  above.  For  the  arithmetic,  we  let  ta  be 
the  elemental  time  needed  for  updating  a  position  during 
the  sweep  (formulae  RS  or  BS).  Since  there  are  n* 
positions  to  update  during  a  sweep,  the  sequential  time  for 
a  problem  of  size  n  is  Tseq  =  n^  ta- 

4.1.  MEMORY  REQUIREMENT 
The  space  requirement  for  the  sequential  algorithm  is 
proportional  to  the  size  of  the  map,  that  is  n^  positions. 
For  each  position,  we  need  to  store  a  word  for  its  current 
value  and  8  words  for  the  traversability  cost  of  the  8 
adjacent  edges.  Let  us  choose  as  a  unit  the  memory 
requirements  for  a  position.  Given  a  single  processor  with 
a  memory  of  size  M,  this  implies  that  the  maximal 

problem  size  that  can  be  dealt  with  is  n^ax,!  = 
Consider  now  a  ring  of  p  processors.  First  of  ail,  we  have 
to  determine  the  relationship  between  p  and  n.  We  have  p 
memories  of  size  M,  so  that  we  can  solve  in  parallel  a 

problem  of  size  at  most  n^ax.p  =  ^pM.  Note  that  we 
neglect  here  any  additional  storage  required  by  the  parallel 
implementation,  such  as  the  need  for  communication 
buffers.  In  fact,  the  value  nniax,p  above  is  an  upper 
bound. 
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As  staled  before,  we  consider  an  allocation  by  blocks  of 
consecutive  rows  of  si/e  r  in  a  wr^)aiound  fasUon,  where 
1  £  r  ^  n/p.  For  the  sake  of  simplicity  (without  loss  of 
generality),  we  assume  that  p*r  divides  n,  so  that  each 
processor  holds  the  same  number  of  rows  in  its  local 
memrvy. 

4.2.  PARALLEL  EXECUTION  TIME 
Even  though  the  implementation  is  asynchronous,  we  can 
view  the  parallel  algorithm  as  a  succession  of  time-steps, 
where  at  each  time-step  each  processor  updates  r  segments 
of  k  positions.  Within  a  time-step,  processor  Pi  receives  a 
message  of  length  k  from  processor  Pj.i,  updates  r*k 
positions,  and  sends  a  message  of  length  k  to  Pj.*.] 
(indices  are  taken  mod  p).  Note  that  the  emission  is  non- 
blocking,  whereas  the  reception  is.  Pi  does  not  wait  for  its 
emission  to  be  completed  before  moving  to  the  next  step. 
As  a  consequence,  the  communication  within  a  time-step 
has  a  cost  equal  to  ^  +  k  Xc-  The  total  time  needed  to 
perform  a  time-step  is  Xstep  =  P  +  kXc  +  rkXa 

To  evaluate  the  total  number  of  time-steps  in  the 
algorithm,  we  first  compute  the  time-step  at  which  a 
processor  Pq,  0^<p-l,  initiates  its  computation.  Recall 
that  Po  updates  Iq  positions  in  its  first  row  at  time  to=  0. 
We  see  that  Pi  up^tes  li  =  (lo-r)  mod  k  positions  in  its 
first  row  at  time  ti  =  1  +  f  (r-lo)  /  kl,  and  mote  generally, 
that  Pq  updates  Iq  positions  in  its  first  row  at  time  tq, 
whdc 

Iq  =  (lo  ■  q*r)mod  k,  tq*q  +  r(q*r-lo)/kl 

Now,  we  derive  easily  the  total  number  of  time-steps  Tp, 
since  Pp.i  is  the  last  processor  to  end  its  computation. 
After  updating  its  first  parallelogram,  Pp.ihas  still 
r  (n2/(p*r)-lq  +  r-l)/k)l 

parallelograms  to  update,  so  that 

Tp  =  tp.i  +  r  (n2/(p*r)  -  Iq  +  r  - 1)  /  k)  1. 

The  parallel  execution  time  of  the  ^gorithm  is  then 
T//  =  'tstep  *  Tp 

This  evaluation  is  valid  only  if  the  processors  are  not  kept 
idle,  waiting  for  some  data  they  need  from  their 
predecessor.  As  explained  in  the  previous  section,  this 
condition  is  equivalent  to  ensuring  that  Pq  has  not 
finished  the  updating  of  its  first  block  before  receiving 
from  Pp.1  the  data  that  it  needs  for  its  second  block.  Pq 
performs  its  first  reception  at  time  tp.  At  that  time  it  has 
already  updated  Iq  +  k  *  tp.i  positions  in  the  first  row  of 
its  first  block.  The  condition  is  that  the  sum  of  the 
remaining  positions  in  this  row  plus  the  number  of 
positions  that  it  might  update  in  the  first  row  of  the 
second  block  is  greater  than  or  equal  to  k,  so  that  it  can 
update  a  whole  parallelogram  at  time  tp. 

n  -  (lo  +  k*tp.i )  +  Ip  2  k 
After  some  algebra  we  get : 
n  S  p  (r+k) 


We  retrieve  the  ctxidition  illustrated  in  figures  8  and  9. 

Neglecting  low  order  terms  and  ceiling  functions,  we 
obtain  tl^  following  analytical  evaluation  for  the  pai^lel 
execution  time  T//: 

Proposition  :  Given  a  problem  of  size  n  and  a  ring  of  p 
processors,  the  parallel  execution  time  T//  for  a  block-r 
allocation,  l<r^i^,  using  segemnts  of  length  k,  1  <  k  < 
n/^  -  r ,  is 

T//=  (  d  *  k  Xc  ♦  r  k  t,  )[(P-I)(l*0 

Given  n,  p  and  r  it  is  easy  to  find  the  value  kopt(r)  of  k 
that  miminizes  the  execution  time  T//.  We  obtain  the 
value 

kopt(r)  =  min(  kmax(r),  k//(r)  ) 

where 

kmaxfr)  =  n/p  -  r 
and 

k//  is  the  optimal  value  obtained  from  the 
expression  of  T//i 

k//(r)  = 

Given  n,  p  and  numerical  values  for  the  parameters 
^  compute  k^p^  and  to  plug  it  into  the 
expression  of  Tff  to  determine  the  best  value  of  r.  We 
report  numerical  experiments  in  the  next  section. 

5.  NUMERICAL  EXPERIMENTS 

In  this  section, we  report  on  numerical  experiments  on  a 
ring  of  Inmos  Transputers  T414,  using  up  to  32 
processors.  We  use  a  I^S-T40  hypercube  [GHS],  which 
we  configure  as  a  ring.  First  of  all  we  have  to  determine 
Tg  and  T{.. 

Each  update  in  (RS)  or  (BS)  amounts  to  four  additions, 
four  comparisons,  plus  some  conditional  logic.  We  find 
that  xa  =  75e-6  seconds.  For  the  communications,  we 
obtain  experimentally  that  the  time  to  transfer  L  words 
between  two  adjacent  processors  is  p  +  Ltc,  with  P  =  2e-3 
seconds  and  Xc  =  12.Se-6  seconds. 

The  first  thing  we  check  is  that  the  parallel  execution  time 
obeys  our  formulas.  We  fix  n  and  p  and  let  the  segment 
size  k  vary,  with  various  values  of  the  block  size  r.  We 
superimpose  in  figure  1 1  the  experimental  and  thetvetical 
curves  (with  the  previous  values  of  P,  Xc  and  Xa)  for  the 
parallel  execution  time.  There  is  a  very  good  adequation 
between  the  curves. 

We  find  experimentally  the  optimal  values  of  r  and  k:  rgpt 
=  6  and  kopt  *  54.  For  these  values  we  obtain  T//  =  10.36 


seconds.  These  values  are  in  good  accordance  with  the 
theory:  if  we  plug  the  values  of  n  =  1920  and  p  =  32  in 
the  formulas  of  the  previous  section,  we  obtain 
.  /  .  f  kmax(r)  for  r  ^  5 

KoptW  =  for  r  >  6 

and  T//  is  miminum  for  r  =  6  and  k  =  k//(r)  =  54.  We 
obtain  T//  =  10.52  seconds  with  the  analytical 
expressions. 


We  point  out  that  the  execution  time  with  (ropt .  kopt)  is 
divided  by  a  factor  of  23.8  as  compared  to  Bitz  and  Kung's 
algorithm  which  corresponds  to  the  values  (r,k)  =  (1.  1) 
and  for  which  the  execution  time  is  as  high  as  247 
seconds. 
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show  is  the  function  e(r  ,  k)  for  the  following  values  of  r 
and  k:  1  <  r  ^  n/(2p),  I  <  k  <  kmaxCr)-  The  optimal 
efficiency  e  »  0.81  is  obtained  for  the  highest  point  of  this 
surface,  with  r  =  6  and  k  =  54. 


Figure  13  :  3D-plot  of  the  efficiency  e(rjk), 
n  =  1920,  p  =  32 


6.  CONCLUSION 


Figure  11 :  Parallel  time  as  a  function  of  k  ; 
n  =  1920;  p=32 

In  figure  12,  we  plot  the  speedups  that  we  obtain  with  32 
processors  when  solving  a  problem  of  size  n  =  1920.  Note 
that  these  speedups  are  computed  according  Gustafson's 
recent  proposal  [Gus,  CRT),  in  that  they  are  normalized 
by  the  amount  of  arithmetic  operations  which  they  require 
(since  it  is  impossible  to  solve  such  a  large  problem  with 
a  single  processor).  Using  32  processors,  we  report 
acceleration  factors  as  high  as  26. 


ScAiAd  ap—diip 


In  this  paper,  we  have  discussed  the  implementation  of 
Bitz  and  Kung's  path  planning  algorithm  on  a  ring  of 
general-purpose  processors.  We  have  designed  a  modified 
version  that  updates  a  segment  of  k  positions  within  a 
step  and  allocates  blocks  of  r  consecutive  rows  of  the  map 
to  the  processors  in  a  wraparound  fashion.  We  have 
analytically  determined  the  optimal  values  of  the 
parameters  (k^*)  which  minimize  the  parallel  execution 
time  as  a  function  of  the  number  of  processors  p  and  of 
the  problem  size  n.  The  theoretical  results  are  nicely 
corroborated  by  numerical  experiments  on  a  ring  of  32 
Transputers.  We  obtain  a  spet^up  of  23.8  over  Bitz  and 
Kung's  algorithm. 
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Abstract 

A  new  approach  to  find  a  near-optimal  collision- 
free  path  is  presented.  The  path  planner  is  an  imple¬ 
mentation  of  the  adaptive  error  back-propagation  al¬ 
gorithm  which  learns  to  plan  “good”,  if  not  optimal, 
collision-free  paths  from  human-supervised  training 
samples. 

Path  planning  is  formulated  as  a  classification 
problem  in  which  class  labels  are  uniquely  mapped 
onto  the  set  of  maneuverable  actions  of  a  robot  or 
vehicle.  A  multi-scale  representational  scheme  maps 
physical  problem  domains  onto  an  arbitrarily  chosen 
fixed  size  input  layer  of  an  error  back-propagation 
network.  The  mapping  does  not  only  reduce  the  size 
of  the  computation  domain,  but  also  ensures  appli¬ 
cability  of  a  trained  network  over  a  wide  range  of 
problem  sizes.  Parallel  implementation  of  the  neural 
network  path  planner  on  hypercubes  or  TVansputers 
based  on  Parasoft  EXPRESS  is  simple  and  efficient. 
Simulation  results  of  binary  terrain  navigation  indi¬ 
cate  that  the  planner  performs  effectively  in  unknown 
environment  in  the  test  cases. 

Introduction 

Robots  have  been  successfully  employed  in  very 
restricted,  mechanical,  and  repetitive  tasks  such  as  to 
improve  productivity  and  quality  in  assembly  lines  in 
automotive  industry.  Although  it  is  not  likely  that 
man  can  construct  even  a  near  general-purpose  robot 
in  the  foreseeable  future  given  the  current  level  of 
technology  and  advancement  of  science,  the  future 
generation  of  task-specific  robot  systems  are  expected 
to  be  more  “autonomous”  and  “intelligent”.  These 
future  robot  systems  would  posses  highly  integrated 
capabilities  of  task-specific  sensing  -  to  gather  rele¬ 
vant  information  of  the  environment  and  construct 
a  limited  world  model  of  the  physical  surroundings. 


goal-oriented  planning  -  to  achieve  a  high  level  spec¬ 
ification  of  a  goal  by  generating  a  sequence  of  robot 
actions  in  advance,  motor  control  -  to  execution  the 
planned  sequence  of  actions  step  by  step,  and  learning 
-  to  gain  domain-specific  knowledge  from  experience 
and  response  to  unknown  environment  intelligently. 
In  this  paper,  we  confined  our  study  to  path  plan¬ 
ning  for  robot  navigation. 

The  objective  of  developing  autonomous  robot 
navigation  controller  is  to  enable  a  robot  to  guide  it¬ 
self  moving  from  one  point  of  space  to  a  destination 
without  collision  with  the  obstacles  in  its  environ¬ 
ment.  The  most  basic  form  of  a  motion  planning 
problem  is  the  generalized  mover's  problem,  which 
is  also  known  as  the  Findpath  or  obstacle  avoidance 
problem  [1.]  The  goal  is  to  find  any  collision- free 
path.  For  economic  reasons,  the  path  that  a  robot 
tracks  should  obey  some  constraints,  which  is  usu¬ 
ally  in  time  and/or  energy  usages.  For  all  practical 
purposes,  the  notion  of  planning  a  “good”  path  is  of 
prime  importance  to  any  reasonable  navigation  con¬ 
troller. 

There  are  many  variants  of  the  path  planning 
problem.  The  task  of  planning  an  optimal  path  is 
achievable  only  for  simple  problems.  Most  of  the 
time,  the  amount  of  computation  required  to  obtain 
such  a  path  could  be  costly.  In  many  circumstances, 
optimal  paths  are  not  required.  It  is  often  more  im¬ 
portant  to  obtain  a  “good”  (i.e.,  nearly  but  not  pre¬ 
cisely  optimal)  path  quickly  than  to  devote  precious 
computational  resources  to  find  the  exact  solution.  In 
fact  the  input  data  is  often  imprecise  (e.g.  the  exact 
nature  of  the  terrain  is  not  known)  and  the  notion  of 
a  precise  optimal  path  undefined.  Several  new  opti¬ 
mization  techniques  such  as  simulated  annealing,  neu¬ 
ral  networks,  elastic  networks  and  genetic  algorithms 
have  been  devised  for  such  approximate  optinvization 
problems  [2-7.] 
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The  use  of  a  multi-layer  feedforward  neural  net¬ 
work  for  path  planning  was  first  reported  ir  [8],  and 
some  preliminary  results  on  performance  of  such  a 
trained  network  were  discussed  in  [9.]  In  this  pa¬ 
per,  we  will  describe  in  detail  the  implementation 
of  the  neural  approach  used  in  [8,9]  to  the  problem 
of  planning  a  neeir-optimal,  collision-free  path  for  a 
single  mobile  robot  moving  in  2-dimensional  binary 
terrains.  Computational  performance  of  the  paral¬ 
lel  algorithm  on  several  distributed-memory  MIMD 
processors  like  NCUBE-1,  MEIKO  Computing  Sur¬ 
face  (Transputer-based  system),  and  iPSC-2  are  com¬ 
pared.  We  will  present  algorithmic  and  implementa¬ 
tion  perform2inces  for  cases  of  robot  navigation  on 
binary  terrain. 

Modeling  The  Navigation  Terrains 

We  have  chosen  to  apply  our  new  approach  to 
the  simplest  non-trivial  path  planning  problem  which 
is  the  navigation  of  a  single  vehicle  in  a  plane  with 
binary  terrain.  The  terrain  partitioning  is  a  standard 
grid  tessellation  of  the  physical  space  of  the  problem 
domain  which  contains  regions  of  random  or  struc¬ 
tured  obstacles.  The  remaining  regions  are  robot 
traversable  space.  The  discretized  binary  problem 
domain  is  represented  as  a  2-d  matrix.  A  measure 
of  the  size  of  an  instance  of  the  path  planning  prob¬ 
lem  is  the  number  of  elements  in  the  2-d  matrix.  The 
higher  the  resolution  of  discretization,  the  bigger  the 
problem  size. 

Although  our  technique  can  be  extended  to 
cover  a  more  general  path  planning  problem,  the 
problem  statement  of  our  current  study  is  stated  as; 

Given  a  discretized  2-d  rectangular  physi¬ 
cal  domain  R,  a  distribution  of  binary  ob¬ 
stacles  D  over  R,  the  robot’s  current  posi¬ 
tion  S  G  R  —  D,  a.  high  level  specification 
of  the  target  position  T  G  R  —  D  —  S,  and 
a  set  of  maneuverable  actions  or  motor 
control  constraints  C  governing  the  robot, 
find  a  near-optimal,  collision-free  path  for 
the  robot  to  move  around  in  R  from  S  to 
T. 


ator  can  handle  easily  and  efficiently.  Obvious  ex¬ 
amples  are  in  pattern  classification  and  speech  recog¬ 
nition.  Another  example  is  in  playing  chess.  It  has 
been  a  long-standing  speculation  that  a  good  chess 
player  recognizes  abstracted  patterns  of  the  current 
board  and  commands  a  move  with  efficacy,  while  a 
chess  playing  computer  program  has  to  evaluate  and 
search  a  huge  game  tree  of  legal  moves  and  counter¬ 
moves  rooted  from  the  current  board,  iteratively  to  a 
fixed  depth. 

In  the  same  vein,  the  path  planning  problem  cam 
be  transformed  to  a  pattern  classification  problem  in 
which  cleiss  labels  are  uniquely  mapped  onto  the  set  of 
maneuverable  actions  C  of  a  robot,  while  each  time 
instance  of  a  scenario  is  mapped  onto  a  2-d  binary 
pattern.  We  used  a  multi-layer  perceptrons  based  on 
the  adaptive  error  back-propagation  algorithm  [10]  as 
the  pattern  classifier  for  the  transformed  path  plan¬ 
ning  problem. 

The  back-propagation  model  has  been  widely 
used  for  pattern  recognition  tasks.  The  architecture 
of  such  a  neural  net  model  has  an  input,  output,  and 
intermediate  layers.  All  inter-layers  cu-e  fully  con¬ 
nected.  Unlike  the  Hopfield  model  which  has  recur¬ 
rent  connections[4,]  the  perceptrons  model  does  not 
provide  a  feedback  mechanism  for  neuronal  activa¬ 
tion  to  propagate.  A  set  of  training  sample  pairs 
which  carries  some  form  of  relevant  information  about 
the  classification  problem  at  hand  is  used  to  train 
the  multi-layer  network.  Iterative  synaptic  weight 
adaptation  occurs  following  the  back-propagation  al¬ 
gorithm  as  the  error  signal  at  the  output  layer  is 
propagated  backward  and  filtered  by  the  same  set  of 
synaptic  connections  for  forward  propagation. 


The  Issue  of  Representation 

The  success  of  the  back-propagation  model  on 
pattern  recognition  problems  relies  heavily  on  the 
choice  of  the  pre-processing  operations.  The  choice 
of  pre-processing  operations  for  raw  pattern  data  de¬ 
termine  the  selective  pruning  and  encoding  of  infor¬ 
mation.  The  representations  that  emerge  from  these 
operations  impose  constraints  on  subsequent  process¬ 
ing  by  the  back-propagation  neural  model. 


Supervised  Neural  Network  Approach 

A  reasonable  pre-processing  scheme  should  be 
Artificial  neural  networks  have  been  employed  one  which  reduces  and  encodes  raw  patterns  into 

in  a  variety  of  applications,  and  were  found  most  use-  some  form  of  standard  representations.  We  adopted 

ful  in  the  class  of  applications  which  a  human  oper-  a  non-linear,  multi-scale  sampling  strategy  which 
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mapped  a  pliysical  problem  domain  onto  an  arbi¬ 
trarily  chosen  fixed  size  input  layer  of  an  error  back- 
propagation  network.  The  multi-scale  representation 
of  patterns  is  a  natural  consequence  of  the  sampling 
strategy  used. 

The  sampling  strategy  involves  using  high  res¬ 
olution  neurons  to  encode  terrain  information  close 
to  a  robot,  2uid  progressively  coarser  neurons  away 
from  a  robot  by  increasing  the  sampling  interval. 
The  physical  problem  domain  is  separated,  in  a  fuzzy 
sense,  into  a  near  field  and  a  far  field.  Near  field  infor¬ 
mation  which  is  encoded  in  high  resolution  neurons 
is  used  to  generate  immediate  action  (corresponds  to 
local  planning,)  while  far  field  information  which  is 
encoded  in  the  coarse  neurons  is  used  for  global  plan¬ 
ning. 

The  only  traffic  regulation  imposed  on  robot 
motion  is  that  every  move  must  be  collision-free,  lii 
this  study,  a  robot  was  restricted  to  move  in  one  of  the 
five  admissible  directions  in  a  2-d  rectangular  prob¬ 
lem  domain,  and  the  cost  to  move  in  any  of  the  five 
admi.ssible  rlircctions  is  the  same.  Oiir  formulation 
of  robot  maneuverable  motions  could  be  extended  to 
eight  directions  to  include  the  eight  nearest  neighbor 
in  the  case  of  a  2-d  grid  tessellation.  Figure  1  shows 
the  five  admissible  moves  of  a  robot.  Since  tlie  cost 
of  selecting  to  move  in  any  of  these  five  directions 
are  the  same,  an  optimal  collision-free  path  would  be 
one  which  minimizes  the  number  of  moves  required  to 
get  from  a  source  point  to  a  target  position  without 
violating  the  traffic  regulation. 


Figure  1:  Vehicle  is  constrainted  to  move 
in  one  of  the  five  directions.  All  moves 
have  the  same  cost. 

The  five  admissible  moves  are  mapped  one  to 
one  onto  five  grandmother  cells  at  tlie  output  layer 


of  a  beick-propagation  model.  The  activation  values 
of  the  five  grandmother  cells  indicate  how  good  it 
is  to  move  a  robot  in  each  of  the  five  corresponding 
directions  (see  Fig.  2.) 


ACTIVATION  OF  OUTPUT  NEURON 
INDICATES 

SCALE  OF  GOODNESS 


Figure  2:  A  robot  moves  in  the  direction 
which  corresponds  to  the  highest  activa¬ 
tion  voltage. 

We  have  arbitrarily  chosen  to  use  four  differ¬ 
ent  scales  and  nine  general  directions  (East,  NEE, 
NE,  NNE,  North,  NNW,  NW,  NWW,  West)  to  repre¬ 
sent  each  scenario.  This  sampling  strategy  divides  the 
problem  domain  into  36  regions.  At  the  input  layer, 
one  neuron  is  needed  to  encode  information  for  each 
region  which  is  actually  one  combination  of  direction 
and  scale.  All  together,  36  neurons  are  needed.  An 
intermediate  layer  with  20  neurons  was  used.  The 
number  of  neurons  in  the  intermediate  layer  was  ar¬ 
bitrarily  chosen  to  achieve  a  fein-in  architecture. 

The  activation  value  for  each  input  neuron  is  a 
function  of  a  porosity  index  which  is  a  measure  of  the 
traversability  of  the  corresponding  region.  The  poros¬ 
ity  index  is  taken  as  the  compliment  of  the  density  of 
obstacles.  The  input  neuron  activation  is  computed 
by  using  Eq.  (1) 


=  /(I  -  P.7i)  +  6{target,'ti)  (1) 

where  G  [  —  10, 1.0]  is  the  activation  value  for  neu¬ 
ron  T}i  in  the  region  7, ,  p  is  the  density  of  obstacles  in 
7i,  and  b  is  the  Kronecker  delta  which  equals  to  1  if 
the  t2irget  position  is  in  7i ,  and  zero  otherwise. 

Parallel  Back-Propagation 

The  back-propagation  algorithm  is  an  effective 
training  algorithm  for  the  feed-forward  multi-layer 
perceptron  model.  It  is  a  generalization  of  the  least 
mean  square  algorithm  or  the  delta  rule.  Back- 
propagation  uses  a  gradient  descent  technique  to  min¬ 
imize  a  quadratic  error  function  which  is  defined  as 
the  mean  square  differences  between  the  pair  of  ac¬ 
tual  network  output  vector  and  its  associated  target 
vector  for  the  set  of  training  samples. 

Let  us  define  the  following: 

•  Wij  is  the  connection  weight  between  the  j** 
neuron  in  the  current  layer  and  the  t**  neuron 
in  the  immediate  lower  layer, 

•  is  the  internal  threshold  of  the  neuron  in 
the  current  layer, 

•  Xj  is  the  i‘*  continuous-valued  input  from  the 
input  layer  or  the  layer  underneath  the  current 
layer, 

•  x'j  is  the  output  of  the  j**  neuron  in  the  current 
layer, 

•  j/y  is  the  actual  model  output  of  the  neuron 
in  the  output  layer, 

•  and  dj  is  the  desired  or  target  output  of  the 
neuron  in  the  output  layer. 

The  model  is  trained  by  initially  assigning  small 
random  weights  to  the  synaptic  connections  and  small 
random  thresholds  to  the  artifici2il  neurons.  The  neu¬ 
ronal  outputs  from  each  layer  au’e  then  computed  by 

x'j  =  /(  (2) 

where  N  is  the  number  of  neurons  in  the  layer  below 
the  current  layer.  If  the  current  layer  is  the  physical 
output  layer, 

Vj  —  ^  }•  (3) 

The  neuron  activation  function  /  has  to  be  non¬ 
decreasing,  continuously  differentiable.  Usually,  the 
sigmoid  logistic  function 


«<)  =  <"> 

is  used  for  this  purpose. 

Given  the  measure  of  the  error  on  any  pattern 
in  the  training  set  as 

^  =  E  -  J'i)'  (5) 

j 

and  the  neuron  activation  function  /  as  described  in 
Eq.  (4),  adaptive  correction  of  the  connection  weights 
in  the  direction  of  —dEldw  corresponds  to  perform¬ 
ing  a  steepest  descent  search  in  the  weight  space  to 
minimize  error.  Synaptic  weight  adaptation  follows 
Eq.  (6) 

Awijit  -h  1)  =  +  aAu),y (1)  (6) 

where  t)  is  the  learning  rate,  a  is  the  momentum  term 
which  determines  how  much  is  remembered  about  the 
previous  iteration,  and  6  is  the  filtered  error  signal. 

Similarly,  the  internal  thresholds  Oj  are  cor¬ 
rected  adaptively  in  the  threshold  space.  The  partial 
derivatives  —dE/dw  and  —dE/d0  are  computed  by 
propagating  error  signals  from  the  output  layer  back 
to  the  lower  layers  through  the  net,  which  motivates 
the  name  “back-propagation” . 

Parallel  implementation  of  the  back-propagation 
model  for  Chinese  character  recognition  on  hyper¬ 
cubes  based  on  a  character  decomposition  technique 
using  bitmap  masks  has  been  discussed  by  [11.]  Our 
current  implementation  of  the  path  planner  is  based 
on  a  distribution  of  the  set  of  training  patterns  over 
the  number  of  processors  of  a  hypercube.  Essen¬ 
tially,  each  processor  of  an  allocated  hypercube  or 
TVansputer-beised  concurrent  processor  is  responsible 
for  only  a  small  subset  of  the  set  of  training  samples. 
Using  Peurasoft  EXPRESS  as  the  communication  soft¬ 
ware  the  same  code  runs  on  NCUBE-1,  iPSC-2,  and 
on  Meiko  Computing  Surface  which  is  a  TVansputer- 
based  system. 

The  training  set  we  used  consists  of  184  pat¬ 
terns,  which  contains  knowledge  of  184  scenarios  of 
human-supervised  optimal  and  collision-free  moves  in 
binary  terrain  navigation.  The  choice  of  using  184 
patterns  is  arbitrary.  Our  first  goal  is  to  teach  the 
navigator  enough  basic  knowledge  upon  which  it  can 
generalize,  not  just  memorize,  to  cope  with  most  sit¬ 
uations. 
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Learning  Histories  of  A 
Back'Propagation  Path  Planner 


a.  The  learning  histories  for  seven  different  learning 
rates  are  displayed  in  Fig.  5. 


Besides  the  issue  of  pattern  representation,  pa¬ 
rameter  tuning  is  a  major  concern  for  the  back- 
propagation  model  to  converge  fast,  or  converge  at 
all.  Convergence  of  the  model  depends  on  the  ini¬ 
tial  configuration  of  the  network,  the  choice  of  the 
learning  rate  rj,  and  the  momentum  term  a. 

The  learning  histories  for  four  different  initial 
configurations  with  fixed  learning  rate  t]  =  0.10,  and 
momentum  a  =  0.80  are  shown  in  Fig.  3.  Average 
error  is  defined  as  the  average  of  the  total  quadratic 
error  per  pattern  per  output  neuron.  The  back- 
propagation  path  planner  converged  for  these  four 
cases. 
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Figure  3;  The  learning  histories  of  a  bcick- 
propagation  path  planner  for  four  different 
initial  configurations. 


To  study  the  effect  of  tlie  inomentum  term  a  on 
the  convergence  of  the  model,  we  used  a  fixed  initial 
configuration  and  set  ?;  =  0.10.  Figure  4  shows  the 
learning  histories  for  o  =  0.2,  0.4,  0.6,  and  0.8.  The 
behavior  of  the  model  in  its  learning  phase  were  very 
similar  for  the  four  different  values  of  a.  All  four  cases 
converged  roughly  at  the  same  rate  because  they  had 
the  same  learning  rate. 

More  dramatic  effects  were  observed,  as  ex¬ 
pected,  for  the  cases  of  using  different  learning  rates 
T),  and  fixing  the  initial  configuration  and  the  value  of 


Figure  4:  The  learning  histories  of  a  back- 
propagation  path  planner  using  four  dif¬ 
ferent  values  for  the  momentum  term. 

In  general,  the  bigger  the  learning  rate  the  faster 
the  convergence.  However,  when  the  learning  rate  is 
set  to  a  value  that  is  “too  big”,  it  leads  to  big  oscil¬ 
lations  and  instability.  Small  learning  rates  usually 
lead  to  smooth  but  slow  learning. 
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Figure  5:  The  learning  histories  of  a  back- 
propagation  path  planner  for  seven  differ¬ 
ent  learning  rates. 


Performance  of  A  Trained 
Back-Propagation  Path  Planner 


We  tested  the  performance  of  a  trained  back- 
propagation  neural  path  planner  by  submitting  to  it 
unlearned  scenarios  in  the  form  of  47  x  24  as  well  as 
105  X  53  discretized  map.  Several  instances  which 
include  random  and  structured  obstacles  are  shown 
in  Fig.  6  to  10. 


Figure  6;  A  47  x  24  binary  terrain  with 
randomly  distributed  obstacles.  TVian- 
gle  indicates  starting  location,  and  an  in¬ 
verted  triangle  indicates  target  position. 
An  optimal  collision-free  path  was  planned. 


Figure  7:  A  47  x  24  binary  terrain  with 
randomly  distributed  obstacles.  An  opti¬ 
mal  collision-free  path  was  planned. 

Since  an  optimal  path  in  this  study  is  one  which 
minimizes  the  number  of  collision-free  moves  needed 


io  get  from  a  source  to  a  target  position  within  the 
problem  domain,  the  paths  displayed  in  Fig.  6,  7,  9, 
and  10  are  optimal.  However,  the  planned  path  is 
near-optimal  in  Fig.  8. 


Performance  of  Parallel  Implementation 

We  used  up  to  64-node  NCUBEl-1,  iPSC-2,  auid 
16-node  Meiko  for  our  simulations.  A  tradning  set  of 
184  patterns  becomes  a  small  problem  for  the  case 
of  64  processors  because  each  processor  is  then  re¬ 
sponsible  for  performing  computations  for  at  most 
3  patterns.  Figure  11  shows  the  timing  result  for 
running  one  iteration  of  the  back-propagation  path 
planner.  The  reported  time  is  normalized  to  that 
needed  to  run  one  iteration  of  the  same  planner  on  a 
20  MHz  SUN4/60  SPARCstation  1.  For  the  one  pro¬ 
cessor  case,  a  Meiko  TVansputer  node  was  the  fastest, 
achieving  a  performance  close  to  that  of  a  SUN4/60. 
The  efficiencies  of  the  same  program  on  the  three  dif¬ 
ferent  concurrent  processors  are  shown  in  Fig.  12. 
Although  the  efficiency  for  the  simulations  performed 
on  an  NCUBE)-1  seems  to  be  better  than  on  an  iPSC- 
2  or  a  Meiko  TVansputer  systems,  this  result  should 
be  taken  with  care.  'Iven  though  exactly  the  same 
EXPRESS  program  was  used  for  simulations,  there 
were  differences  in  the  hard-wired  configuration  of  the 
three  computer  systems,  and  in  the  implementation 
of  EXPRESS. 
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Figure  8:  A  47  x  24  binary  terrain  with 
structured  obstacles.  The  planned  path  is 
near-optimal. 
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There  is  one  simple  explanation  for  the  poor 
efficiency  of  the  simulations  on  a  Meiko  Computing 
Surface.  EXPRESS  communications  are  best  suited 
for  hypercube  connectivity  concurrent  processors.  At 


the  time  of  the  simulations,  the  Transputer  system 
was  hard-wired  as  a  2-d  torus  instead  of  a  hypercube; 
thus,  resulting  in  poor  performance. 


Figure  9;  A  105  x  54  binary  terrain  with  randomly  distributed  obstacles.  An  optimal  collision-free 
path  was  planned. 


Figure  10:  A  105  x  54  binary  terrain  with  .structured  obstacles.  An  optimal  collision-free  path  was 
planned. 


Efficiency  (%)  Normalized  Time 


EXPRESS  was  optimally  implemented  on  the 
NCUHE-l.  As  for  the  iPSC-2,  EXPRESS  was  imple¬ 
mented  on  top  of  the  native  operating  system  (NX). 
This  extra  layer  in  between  EXPRESS  and  the  hard¬ 
ware  is  expected  to  incur  ineflkiency. 

Cuiiclusioiis 

Simulation  results  indicate  that  a  trained  back- 
propagation  path  planner  possesses  satisfactory  capa¬ 
bility  of  planning  near-optimal,  collision-free  paths  in 
binary  terrains  with  random  or  structured  obstacles. 
The  multi-scale  mapping  scheme  does  not  only  reduce 
the  size  of  the  computational  domain  and  encode  suf¬ 
ficient  information  to  carry  out  the  ])lanning  task,  but 
also  ensures  applicability  of  the  trained  network  on  a 
wide  range  of  problem  sizes. 

The  advantages  of  this  new  approach  of  trans¬ 
forming  a  path  planning  problem  to  one  in  pattern 
classification  by  neural  networks  are; 

•  External  homing  strategy  is  not  required. 

•  No  explicit  heuristic  is  used  for  shortest  path. 

•  No  need  to  decompose  the  problem  domain  into 
configuration  space  and  free  space. 

The  homing  strategy,  and  the  notions  of  optimality 
and  obstacles  avoidance  are  all  encapsulated  into  the 
training  patterns  as  task-specific  knowledge  from  a 
human  teacher. 
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Abstract 

A  two  vehicle  navigator  on  a  descrete  space 
is  analyzed.  The  concept  of  linking  time  maps 
as  source  to  optimal  path  planning  is  discussed. 
The  rules  for  constructing  these  maps  are  given 
in  a  cellular  automata  mode.  The  implemen¬ 
tation  of  these  rules  on  a  parallel  computer  is 
presented. 

1.  Introduction. 

In  this  study  navigation  means  determina¬ 
tion  of  a  path  on  a  navigation  surface  [NS]  from 
an  origin  point  to  a  destination  point.  A  cost 
function  is  defined  on  the  NS,  measures  the  cost 
of  traveling  a  length  segment.  The  cost  can  be: 
time,  length,  hazard  of  traveling  the  segment 
etc.  An  optimal  path  on  the  NS  is  a  path  along 
which  the  integration  of  the  cost  function  from 
origin  to  destination  is  minimum.  The  objective 
of  a  navigator  is  to  find  the  optimal  path  under 
the  constraints  set  by  the  NS.  The  problem  of 
an  optimal  path  for  a  single  vehicle  on  a  contin¬ 
uous  surface  [1]  as  well  as  a  discrete  surfaces  [2] 
were  solved.  This  study  analyses  the  two  vehicle 
navigator  and  presents  the  linking  time  maps  as 
a  tool  to  deal  with  these  problems. 

A  discrete  solution  for  navigation  on  a  con¬ 
tinuous  space  requires  mapping  of  the  space  into 
a  finite  graph.  This  is  done  by  choosing  a  fi¬ 
nite  number  of  points  {«<}  on  the  surface  as  the 
nodes  of  the  graph.  Each  node  is  connected  by 
an  edge  to  all  the  nodes  which  can  be  reached, 
without  traversing  another  node.  The  set  of  all 
the  nodes  {vj}  having  a  common  edge  with  v, 
is  the  set  of  w,-  nearest  neighbors  [nn(i)].  The 
value  Wij  of  the  cost  of  traveling  along  the  di¬ 
rected  edge  [ui,Vj]  is  assigned  to  this  directed 
edge.  This  procedure  maps  the  suiface  onto  a 
directed  graph.  Fig.  1.  Mapping  the  NS  onto  a 
directed  graph  transfered  the  seeirch  for  an  opti¬ 
mal  path  to  the  the  search  for  eui  optimal  path 
on  a  directed  graph.  This  search  is  solved  by  a 


dynamic  programing  approach  [3],  where  a  "sig¬ 
nal”  is  initicilized  at  a  source  point  and  propa¬ 
gates  from  a  node  to  all  its  nn  along  the  edges 
joining  them.  The  time  the  signal  travels  along 
an  edge  is  the  weight  of  the  directed  edge.  Every 
node  V,  records  the  first  time  t,-  it  was  hit  by  the 
signal.  The  graph  in  which  all  the  nodes  Vi  have 
their  correct  time  values  ti  is  called:  the  linking 
time  map  [LTM]  with  respect  to  the  generating 
node. 

In  fact  the  linking  time  ti  at  the  node  m  is 
the  cost  of  an  optimal  path  from  the  source  to 
this  node  and  it  depends  only  on  the  weights  of 
the  edges  and  the  generating  node.  The  linking 
times  ti  and  tj  of  two  sequential  nodes  v,  and 
vj  on  an  optimal  path,  where  Vi  proceeds  vj  are 
related  by: 

tj-  =  ti  +  Wij  (♦) 

An  optimal  path  from  the  origin  to  any 
point  on  the  graph  is  traced  from  that  point  back 
to  the  origin.  Every  step  is  from  a  node  Vj{tj)  to 
a  node  w,(t,)  where  <<  and  tj  satisfy  {"■).  TVacing 
back  ensures  that  one  stays  on  an  optimal  path 
initialized  at  the  origin. 

Let  us  call  the  traveling  object  a  vehicle 
and  consider  the  case  of  two  vehicles  traveling 
on  the  same  NS.  If  the  path  of  each  vehicle  in¬ 
troduces  restrictions  to  the  path  of  the  other  ve¬ 
hicle  (e.g.  collision  avoidance)  then  a  search  for 
an  optimal  solution  is  much  more  complicated. 

The  layout  of  this  study  is  :  In  section 
2  we  discuss  navigation  of  an  autonomous  ve¬ 
hicle  on  an  NS  which  is  updated  while  travel¬ 
ing.  In  section  3  we  intoduce  time  and  deal  with 
conflicts  between  vehicles.  The  resolution  of  a 
conflict  by  imposing  a  delay  on  a  vehicle  is  dis¬ 
cussed  and  the  paths  solving  a  two  vehicle  nav¬ 
igator  are  analysed.  Section  4  outlines  briefly 
the  algorithm  for  two  vehicle  navigator.  Section 
5  presents  the  cellular  automata  rules  for  con¬ 
structing  linking  maps.  Section  6  deals  with  the 
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actual  parallel  implementations  of  the  construc¬ 
tion  of  linking  maps,  and  section  7  presents  the 
simulation  results. 


2.  Autonomous  vehicle  in  uncertain  envi¬ 
ronment 

An  autonomous  vehicle  in  an  uncertain  en¬ 
vironment  start  with  an  estimate  of  the  edges 
weights.  The  estimate  reflects  the  prior  knowl¬ 
edge  or  model  it  has  for  the  terrain  to  be  trans- 
versed.  The  estimate  is  improved  as  more  infor¬ 
mation  is  obtained.  The  vehicle  knows  its  posi¬ 
tion  and  destination  and  at  each  instance  of  time 
the  vehicle  is  doing  the  following;  1)  updates 
the  database  of  the  weights  {tnjj}.  2)  Based  on 
the  updated  data  it  determines  the  optimal  path 
from  its  current  position  to  the  destination.  3) 
Moves  on  the  chosen  optimal  path.  4)  Collects 
data. 


Updated  weights  change  the  LTM, 

but  a  change  in  the  linking  time  of  a  node  may 
effect  the  linking  times  of  only  part  of  the  other 
nodes.  In  section  5  we  show  how  to  update  the 
LTM  in  a  cellular  automata  fashion,  based  on 
local  decisions  of  each  node. 

The  navigator  for  an  autonomous  vehicle  is 
based  on  the  reversed  linking  time  map  [RLTM]. 
The  construction  of  the  RLTM  is  similar  to  the 
construction  of  the  LTM.  Except  that  in  con¬ 
structing  the  RLTM  the  signal  is  initializing  at 
the  destination  point  and  propagates  from  w,-  to 
Vj  with  traveling  time  of  wji.  The  path  is  traced 
from  the  vehicle  position  toward  the  destination, 
from  Vi  with  reversed  linking  time  di  to  its  near¬ 
est  neighbour  vj  with  reversed  linking  time  9j 
which  satisfies: 

9j  =  0i  -  Wij 

Whenever  the  vehicle  gets  new  information  it 
updates  the  {wij  }  database  and  its  RLTM,  and 
determines  an  optimal  path.  Fig.  2. 


3.  Navigation  in  Space-Time,  and  non 
conflicting  paths  for  two  vehicles. 

Assume  that  the  cost  function  is  time,  i.e. 
the  weights  {tnij  }  are  the  time  of  travel  along 
the  corresponding  edges.  Then  a  navigator  for 
two  vehicles  aims  to  find  two  paths,  one  for  each 
vehicle,  which  yield  the  minimum  time  of  travel. 


Assuming  the  two  vehicles  start  at  the  same 
time,  then  time  of  travel  is  the  time  it  takes 
until  both  of  them  have  arrived.  This  optimum 
is  restricted  to  non  conflicting  paths. 


A  conflict  between  two  paths  occurs  when 
the  two  vehicles  are  at  the  same  site  at  the  same 
time.  The  set  of  points  on  the  graph  edges  is 
partitioned  into  sites  as  follows.  Each  point 
is  jissociated  to  the  nearest  of  the  two  nodes 
terminating  the  edge.  A  conflict  can  occur  ei¬ 
ther:  a)  inside  this  site  or  b)  a  swap  conflict  on 
the  boundary  between  two  sites.  In  the  second 
C2ise  the  vehicles  Me  going  in  opposite  directions. 
Let  and  vl,,vj,vl,  be  three  sequential 

nodes  on  the  paths  of  vehicle  0  and  vehicle  1  re¬ 
spectively.  The  node  vj  is  on  the  two  paths.  A 
conflict  of  type  (a)  at  Vj  occur  if  and  only  if 
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A  conflict  of  type  (b)  at  the  boundary  between 
Vj  and  vt  occur  if  and  only  if: 


i'=  Ir 


To  resolve  the  conflict  at  vj  one  vehicle  cannot 
enter  into  the  site  until  the  other  clears  the  site 
of  Vi.  In  the  graph  representation  this  is  done 
by  imposing  a  delay  w  at  v,  on  either  one  of  the 
two  vehicles: 


u;  =  <!./  — 


v’jk' 


n  Wti 
—  4 _ 
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on  vehicle  0,  or 


w  =  tt  — 
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on  vehicle  1.  Imposing  a  delay  w  at  v,  on  ve¬ 
hicle  k  means  that  is  set  to  =  t*  -f  w  and 
LTM*‘  is  accordingly  updated,  imposing  a  delay 
on  a  vehicle  and  updating  its  LTM  preserves  the 
characteristic  of  the  LTM  to  yield,  by  the  trac¬ 
ing  back  procedure,  the  optimal  paths  under  the 
imposed  restriction. 


If  the  optimed  paths  of  vehicles  0  and  I 
have  more  than  one  conflicting  nodes  then;  1) 
their  path  segments  from  the  first  to  the  last 
conflict  have  exactly  the  same  time  of  travel.  2) 
On  these  equivalent  segments  they  are  traveling 
in  the  same  direction.  When  the  two  paths  have 
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more  than  one  conflict,  the  resolution  of  each 
conflict  requires  the  minimal  delay  given  above. 
Therefore,  imposing  the  maximal  delay  of  these 
waits  on  the  first  node  of  conflict  resolves  all  the 
conflicts  between  these  two  paths.  However,  the 
path  with  the  delay  on  it  may  not  be  an  optimal 
path  anymore. 

Consider  the  case  where  an  optimal  path  of 
one  vehicle  conflicts,  at  Vi  with  the  optimal  path 
of  the  other  vehicle.  Assume  that  the  required 
delay  at  Wi  was  imposed  on  one  of  the  vehicles, 
its  LTM  was  updated  and  a  new  optimal  path 
was  traced.  Then  one  of  the  following  will  occur: 

1.  The  new  path  does  not  conflict  with  the 
path  of  the  other  vehicle,  and  the  are  can¬ 
didates  for  an  optimal  solution. 

2.  The  new  path  conflicts  with  the  path  of  the 
other  vehicle,  but  it  does  not  pass  through 

Vi. 


non  sequential  nodes  on  the  path  have  the  same 
NS  coordinates  but  different  time  coordinate.  A 
detour  means  that  two  sequential  nodes  on  the 
path  do  not  obey  the  path  rule,  i.e.  9i+wji  >  6j . 

The  space-time  representation  of  the  paths 
depicts  the  difference  between  this  problem  and 
the  K-disjoint[4]  problem.  In  this  problem  we 
do  not  know  the  t-coordinates  of  the  destina¬ 
tion  points.  These  points  are  subjected  to  the 
searching  process. 

The  complexity  of  a  search  for  an  op¬ 
timal  solution  for  multiple  vehicles  grows 
fast  with  the  number  of  vehicles.  For 
this  reason,  other  suboptimal  methods  are 
investigated,  such  as  neural  networks  [5,6]. 


4.  Algorithm  for  the  two  vehicle  naviga¬ 
tor. 


3.  The  new  path  passes  through  u,  and  it  con¬ 
flicts  with  the  path  of  the  other  vehicle.  In 
this  case  the  new  conflict  is  a  swap  conflict 
at  the  boundary  between  Vj  and  its  pro¬ 
ceeding  node  on  the  other  vehicle  path. 

In  an  optimal  solution  of  the  two  vehicle 
navigator  there  cannot  be  an  instant  when  the 
two  vehicles  are  waiting.  Therefore,  the  paths 
solving  this  problem  can  be  of  three  types: 

1.  Neither  of  the  vehicles  waits. 

2.  One  of  the  vehicles  waits. 

3.  The  two  vehicles  have  to  wait.  The  last 
case  happens  resolving  a  swap  conflict 
when  vehicle  k  has  to  wait  for  vehicle  1 
to  step  aside  letting  k  to  path  and  then 
looping  or  detouring. 

Let  us  extend  the  NS  by  adding  to  it 
the  time  dimension  Fig.  3.  The  graph 
{vi,eij(wi  j)]  on  the  navigation  plane  is  the 
projection  of  the  extended  graph  on  the  f  =  0 
plane.  The  linking  time  value  U  of  a  node  r,  is 
its  t-coordinate  in  the  extended  space.  The  link¬ 
ing  times  ti  auid  tj  of  two  sequential  nodes,  Vj 
and  Vj ,  on  a  legal  path  in  the  extended  space  are 
restricted  to  the  condition  (1).  In  the  extended 
graph  delay  means  that  two  sequential  nodes  on 
a  path  have  the  same  NS  coordinates  but  dif¬ 
ferent  time  coordinate.  A  loop  means  that  two 


The  algorithm  for  the  two  vehicle  naviga¬ 
tor  is  based  on  the  concepts  discussed  in  the  pre¬ 
vious  section  using  the  cellular  automata  rules 
of  the  next  section.  The  idea  is  to  hold  LTM 
and  RLTM  for  each  vehicle  and  to  update  them 
wenever  a  restriction  is  set.  The  need  for  a 
RLTM  arises  whenever  a  swap  conflict  occur, 
and  a  search  for  a  loop  or  a  detour  is  regarded. 

As  was  already  stated:  the  two  vehicles 
cannot  wait  at  the  same  time,  and  a  solution 
which  imposes  delays  on  the  two  vehicles  is  ob¬ 
tained  only  when  one  of  the  paths  is  a  loop  or  a 
detour.  Therefore,  the  algorithm  finds  two  sep¬ 
arate  solutions.  A  solution  when  the  delays  are 
imposed  on  vehicle  1  only  and  a  solution  where 
the  delays  are  imposed  only  on  vehicle  0.  When 
imposing  a  delay  to  resolve  a  swap  conflict  the 
algorithm  checks  for  loop  or  a  detour.  The  best 
of  these  solutions  is  the  optimal  solution.  In 
practice  the  algorithm  will  not  construct  those 
two  solutions,  but  to  minimize  computations,  it 
will  prune  the  search  by  always  adjusting  the 
path  of  the  vehicle  with  the  sorter  time. 

On  a  binary  speed  NS  the  speed  of  the 
vehicle  at  each  point  is  either  1  or  0.  The  two 
vehicle  navigation  problem  on  this  NS  is  much 
easier  as  the  rules  get  simpler  form,  on  this  NS 
a  conflict  of  type  (a)  is  at  the  node  itself  and  it 
needs  lu  =  1  to  be  resolved.  The  swap  conflict 
(of  type  (b))  needs  w=2  to  be  resolved.  Fig.  4 
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presents  the  two  vehicle  navigator  solution  for  a 
conflict  imposing  NS. 


5.  Cellular  automata  instruction  for  the 
navigation  algorithm 

Rule  1:  The  linking  time  of  the  generating  point  is 
always  to  =  0. 

Rule  2:  The  linking  time  U  of  every  node  Vii  ^  o 
is: 

ti  =  +  lOjijVj  6  nn(i)} 

, where  nn(i)  are  all  v,-  nearest  neighbors. 

Rule  2’;  The  reversed  linking  time  6i  of  every  node 
Vii  ^  o  is: 

Oi  =  Min{0i,dj  +  uiy,|Vj  e  ««(»)} 

Rule  3:  If  a  node  other  than  the  generator  does  not 
have  a  source,  it  set  its  linking  time  to  in¬ 
finity.  Namely,  if  i  ^  0,  and  ti  >  tjij  e 
nn(i)  then  t,-  =  oo. 

Rule  3’:  If  i  ^  o,  andOi  >  6jWj  €  nn(t)  then  6i  = 
oo. 

Algorithm  for  constructing  the  LTM  or 
RLTM: 

1.  Initialize  the  linking  times  of  all  the  nodes 
to  ’’infinity”. 

2.  Set  the  generator  linking  time  to  0. 

3.  Apply  rule  2  or  2’. 

4.  When  there  is  not  a  node  which  update  its 
value  the  LTM  or  RLTM  is  done. 

Algorithm  for  updating  the  LTM  or 
RLTM  where  a  delay  W  is  imposed  on 

1.  Setti=ti-\-W  /  ei  =  9i  +  W 

2.  Apply  rule  3  /  3’. 

3.  Apply  rule  2  /  2’. 

6.  Parallel  implementation  of  the  time¬ 
linking  map 

The  cellular  automata  mode  of  construct¬ 
ing  the  LTM  is  asynchronous  but  the  linking 


process  has  a  propagating  nature.  The  wave 
front  of  the  propagating  linking  signal  depend 
on  the  data  and  the  location  of  the  generat¬ 
ing  node.  Therefore,  the  scattered  decomposi- 
tion[7]  would  be  the  most  appropriate  decompo¬ 
sition  approach.  The  mapping  in  this  approach 
is  as  follows;  The  NS  is  tessellated  into  N^xNy 
congruent  templet.  Each  templet  is  tessellated 
again  to  K  equal  tiles,  where  K  is  the  number 
of  processors.  Each  processor  is  assigned  to  the 
same  tile  of  the  templet  over  all  the  templets. 
Fig.  5.  As  the  computational  graph  in  our  case 
is  very  irregular  and  time  dependent,  the  scat¬ 
tered  decomposition  will  hopefully  balance  the 
work  done  in  each  processor. 

As  the  information  propagates  from  a  node 
to  its  neighbors  the  smaller  the  tiles  in  each  tem¬ 
plet  are  the  greater  the  number  of  nodes  propa¬ 
gating  the  correct  linking  time  is.  On  the  other 
hand  the  smaller  the  tile  is  the  greater  the  num¬ 
ber  of  nodes  on  the  boundary  is.  Therefore,  for 
given  number  of  processors  and  dimension  of  the 
descrete  NS  there  is  an  optimal  size  of  tile.  The 
bigger  the  number  of  processors  is  the  smaller 
the  size  of  the  tile. 

In  planning  the  broadcast  of  the  infor¬ 
mation  one  has  to  decide  how  many  inform¬ 
ing  nodes  to  accumulate  before  transmitting  the 
new  data.  On  the  one  hand  accumulating  the  in¬ 
formation  saves  transmition  time.  On  the  other 
hand  getting  the  information  as  soon  as  possi¬ 
ble  save  updating  and  enables  more  templets  to 
participate  in  the  propagation  process.  In  our 
simulation,  we  adopt  the  strategy  of  broadcast¬ 
ing  the  new  information  to  the  neighboring  pro¬ 
cessor  whenever  a  node  on  the  boundary  was 
updated.  When  the  communication  overhead  is 
not  too  large,  as  in  the  case  of  the  Meiko  trans¬ 
puter  board,  the  number  of  updates  are  kept  to 
a  minimum  since  the  information  delay  is  very 
small. 

7.  Simulation  results 

Extensive  simulations  have  been  carried 
out  on  an  NS  which  was  tessellated  into  145  by 
145  nodes.  Fig.  6,  is  a  plot  of  the  speedup  versus 
number  of  processors  under  different  tile  sizes. 
This  plot  shows  that  the  4  processors  are  the 
natural  choice  for  the  two-dimensional  NS.  For 
a  given  patch  size,  the  speedup  decreases  with 
the  number  of  processors.  This  is  expected  be¬ 
cause  of  the  propagated  nature  of  the  problem  at 
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hand.  The  simulation  shows,  as  depicted  in  Fig. 
6,  the  optimal  sizes  of  the  example  simulation. 
These  sizes  are:  19il9  for  4  processors,  13il3 
for  8  processors,  and  approximatly  9i9  for  16 
processors.  It  shows  the  general  trend  that  in¬ 
creasing  the  number  of  processors  decreases  the 
optimal  size  of  the  tile. 
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Figure  1.  mapping  of  a  terrain  onto  a  graph 


*  On  a  leave  of  absence  from  NRCN  Isrstel. 
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Figure  2.  Autonomous  vehicle  in  uncertain  environment.  The  gray  level  of  an  area  is  proportinal  to  its  cost. 

The  white  lines  are  the  equi-cost  contours  After  a  short  travel  along  the  optimal  path  (a)  the  vehicle 
updated  its  data  and  determined  a  new  optimal  path  (b). 
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Figure  3.  Nodes,  terrain’s  directional  values  (gray  level  arrows)  and  a  path  in  the  Cost-Terrain  space. 


Figure  4.  The  two-vehicle  navigator  solution  for  a  conflict  imposing  terrain  and  a  path  in  the  Cost-Terrain  space 
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Figure  5.  Scattered  decomposition,  the  basic  template  of  4  processors  is  repeated  over  the  terrain. 


4x4  8x8  12x12  16x16 

block  size  for  decomposition 


Figure  6.  Speedup  for  decomposition  scheme  for  different  block  sizes  on  16-node  Meiko  Computing  Surface 
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Abstract 

We  develop  a  neural  network  formula¬ 
tion  for  multi-vehicle  navigation  on  a  two- 
dimensional  surface.  here.  A  time-linking 
map  is  generated  for  each  individual  vehicle  us¬ 
ing  techniques  similair  to  the  known  shortest 
path  algorithms  for  an  isolated  vehicle.  Neu¬ 
ral  networks  are  then  applied  to  generate  non¬ 
conflicting  paths  minimizing  the  time  of  travel. 

1.  Introduction 


last  arrival  at  a  destination.  When  two  vehi¬ 
cles  or  more  are  involved  in  the  problem,  the  re¬ 
quirement  of  collision  avoidance  may  introduce  a 
conflict  between  the  optimal  paths  of  the  various 
vehicles.  In  order  to  resolve  these  conflicts  the 
data  base  of  possible  paths  is  vastly  extended 
and  the  search  for  optimal  solution  is  very  com¬ 
plicated.  In  a  different  study  ®  we  have  directly 
solved  the  one-  and  two-vehicle  navigator  in  a 
multi  speed  discrete  space.  However  we  did  not 
find  a  way  to  extend  it  as  a  practical  technique 
for  the  general  multi-vehicle  navigator. 


This  paper  presents  a  neural  network  ap¬ 
proach  to  the  multi-vehicle  navigation  problem. 
Here  we  use  the  term  vehicle  to  refer  to  a  point 
which  travels  on  a  surface  of  navigation  (NS). 
Navigation  as  presented  here  refers  to  the  de¬ 
termination  of  a  path  in  the  space-cost  (time) 
domain  from  an  origin  to  a  destination  point. 
The  surface  of  the  navigation  usually  has  a  ter¬ 
rain  with  position  dependent  velocities  and/or 
hazards  which  the  vehicle  has  to  consider.  The 
navigator  se^lrches  for  an  optimal  path  on  this 
surface.  Optimum  here  may  be  with  regard 
to  minimal  length,  minimal  time,  minimal  haz¬ 
ards,  etc.  Each  of  these  parameters  when  min¬ 
imized  acts  as  the  cost  parameter.  To  each  el¬ 
ement  of  area  dl  x  dl  of  the  NS  is  associated 
the  value  dt  of  the  cost  of  traveling  the  segment 
length  dl  on  this  area.  An  optimal  path  between 
source  and  destination  is  the  one  which  yields 
mtn  I  at. 

J$ource 


Navigation  problems  for  one  vehicle  on  a 
continuous  surface  as  well  as  on  a  discrete  grid 
have  already  been  studied  eind  solved''^.  In  our 
paper  we  consider  navigation  of  more  than  one 
vehicle  in  a  two  dimensional  space,  where  each 
vehicle  has  its  own  origin  and  destination.  The 
objective  is  to  navigate  the  vehicles  in  a  way 
which  minimizes  the  cost  (time)  of  travel.  The 
time  of  travel  is  the  time  passed  between  the 
earliest  start  time  of  one  of  the  vehicles  to  the 


A  simple  NS  terrain  is  defined  with  a  bi¬ 
nary  speed.  On  this  NS  the  speed  of  a  vehicle, 
at  each  point,  is  either  a  positive  constant  or  0 
(for  an  obstacle).  The  present  study  is  an  at¬ 
tempt  to  construct  a  multi-vehicle  navigator,  in 
binary  speed  space,  using  neural  networks.  By 
using  neural  networks  one  usually  trades  an  op¬ 
timal  solution  accomplished  in  ’’infinite”  time 
with  ”good”  solution  accomplished  in  reason¬ 
able  time. 

This  study  is  organized  as  follows:  In  sec¬ 
tion  2  we  introduce  the  cost-surface  space  and 
the  patns  as  graphs  in  this  space.  The  cost- 
linking  map  is  presented  and  we  discuss  the  dif¬ 
ference  between  paths  solving  a  one-vehicle  nav¬ 
igator  and  those  solving  multi-vehicle  navigator. 
In  section  3  the  neural  formulation  is  presented 
with  the  mapping  of  the  spatce  into  neural  vari¬ 
ables,  “neural  paths”  and  equations.  Section  4 
contains  the  results  of  our  simulations  and  sec¬ 
tion  5  the  discussion. 

2.  Paths  in  the  cost-surface  space 

A  descrete  representation  of  the  surface  of 
navigation  is  obt^lined  by  mapping  the  surface 
onto  a  graph  as  follows:  a  set  of  points  r(x,  y)is 
chosen  on  the  navigation  space  to  be  the  nodes 
of  the  graph.  Each  node  is  connected  by  an  edge 
to  every  other  node  which  can  be  reached  di¬ 
rectly  from  it.  To  every  edge  is  assigned  a  weight 
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value,  at  stable  state,  is: 


which  reflects  the  cost  of  traveling  it.  The  edges 
can  have  two  different  weights  for  traveling  it 
in  opposite  directions.  A  path  is  a  sequence  of 
adjacent  directed  edges  from  the  origin  to  the 
destination. 

Let  us  extend  the  NS  terrain  to  a  time- 
surface  space,  as  shown  in  Figure  1.  A  path  in 
this  space  is  a  sequence  of  edges,  monotonic  in 
t,  between  the  source  and  destination.  However, 
a  legal  path  is  one  which  obeys  the  restrictions 
set  by  the  terrain.  In  order  to  get  only  legal 
paths,  we  construct  a  time-linking  map®.  This 
map  assigns  to  every  node  the  minimal  time  of 
travel  needed  to  reach  it  from  the  origin.  Using 
this  map  one  can  construct  a  graph  of  all  the 
optimal  paths  from  the  source  to  all  the  nodes. 
This  map  specifies  the  t-  coordinate  of  each  node 
of  the  graph  in  the  time-NS  space.  An  optimal 
path  for  one  vehicle,  in  the  time-surface  space, 
is  single-valued  in  v(x,y)  and  t.  Namely,  there 
is  a  one  to  one  correspondence  between  v(x,y) 
on  the  path  and  t.  When  more  than  one  vehi¬ 
cle  are  involved  each  one  of  them  has  its  own 
linking  map.  However,  the  optimal  paths  of  two 
different  vehicles  may  conflict.  To  avoid  such  a 
conflict  one  of  the  vehicles  may  be  requested  to 
postpone  its  arrival  to  or  to  detour  the  point  of 
conflict.  This  imposition  introduces  paths  which 
are  not  single  valued  in  v(x,y)  and  t  as  illus¬ 
trated  in  Figure  2. 

3.  Neural  formulation 

Neural  networks  have  been  studied  as  an 
approach  to  various  hard  (NP-complete)  opti¬ 
mization  problems.  Various  applications  have 
been  investigated  and  explored^  ®  since  the  work 
of  Hopfield  and  Tank®.  Here,  we  explore  the 
possibility  of  using  these  massively  parallel  net¬ 
works  for  the  multi  vehicle  navigation  problem. 

The  paths  of  the  vehicles,  as  discussed 
above,  are  viewed  as  trajectories  in  the  space- 
time.  The  space-time  is  mapped  into  neural 
variables  in  the  following  way:  it  is  divided  into 
a  regular  three  dimensional  lattice  (x,y,t).  (For 
notational  simplicity,  we  denote  (x,y)  by  the 
vector  X  subsequently.)  To  each  unit  cell  we  as¬ 
sociate  a  neural  variable  rji(x,t)  whose  desired 


if  vehicle  » is  at  position  x 
at  time  t; 

otherwise. 


A  path  is  a  sequence  of  neural  variables  with 
tfi(x,t)  =  1  where  t  ranges  from  0  to  T  and  T 
is  the  time  this  path  is  traveled.  A  neural  net¬ 
work  is  set  up  such  that  the  neurons  converge 
to  a  stable  state  which  determines  the  paths  as 
illustrated  in  Figure  3.  A  common  practice  in 
optimization  by  neuraJ  networks  is  to  choose  an 
energy  function.  However,  finding  the  shortest 
paths  is  an  iterative  process,  which  makes  our 
energy  function  time-dependent.  Therefore,  in¬ 
stead  of  minimizing  an  energy  function  we  di¬ 
rectly  write  down  the  equations  relating  the  in¬ 
put  "voltage”  of  the  neurons  to  their  output 
voltage.  These  equations  impose  the  desired  be¬ 
havior  of  the  neurons.  Specifically,  we  have 


du,(  X,  t)/ dx  —  C;  -f-  C2  +  C3  +  Ci 


where  the  first  term  evaluates  the  propaga¬ 
tion  of  the  path  from  the  present  position  in 
the  for  weird  direction.  The  second  term  evalu¬ 
ates  it  with  respect  to  the  backward  direction. 
The  third  term  avoids  head-on  collision  and  the 
fourth  term  avoids  swapping  which  occurs  when 
two  vehicles  adjacent  to  each  other  switch  posi¬ 
tions.  The  fifth  term  forces  one  of  the  neighbors 
of  an  “on”  neuron  to  be  on,  i.e.  enforces  conti¬ 
nuity  of  the  paths.  In  terms  of  neural  variables, 
the  dynaimical  equations  is  as  follows: 


dui(x,t)/dT  =  gate{i)(^ 

-  «i(x,f) 

y€Nb{x) 

+  42  ^  ij,(y,t  +  l)Wy,,pasti(x,t) 

yeNb(x) 

+  A3'^t)k{x,t)9{si{t)  -  Sk{t))+ 

lb 
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+  X!!  -  ^)Vk{y,t)- 

k  y 

JJ  (l-»j<(x',<-l)(W^»-W'„.)/2)- 

r'6N6(r),x'96y 

n  (l-»?*(x',t-l)(Vy»-Wy,.)/2) 

x'£Nh(y),x'^x 

+  M  51  f{vi{y,t-\)y 

y^Nb{x) 

JJ  (l-/l(%(x',<))) 

x'eNb(y) 

Where  every  term  corresponds  to  the 

respective  term  C,.  »?i{x,<)  =  h(«,(a:,<)),  h(  )  is 
a  sigmoid  function  giving  the  relation  between 
the  input  and  the  output  voltage  of  a  neuron. 
Nb{x)  =  neighborhood  of  x.  W^y  is  the  cost  of 
travel  from  i  to  y  with  regard  to  the  destina¬ 
tion  of  the  vehicle.  Namely,  Wxy  =  T(x)  —  T(y) 
where  T(u)  is  the  time-linking  map  value  of  the 
node  u.  If  T(y)  <  T{x),  it  encourages  the  for¬ 
ward  (in  time)  propagation  of  the  path  from  x 
to  y.  pasti{x,t)  =  52  -  1)  gates  the 

y€Nb(x) 

backward  propagation.  A  neuron  is  affected  by 
the  future  information  only  if  it  is  a  continu¬ 
ation  of  a  path.  Si{t)  =  52® ’?«(*> 0i  ^^nd  y(.) 
is  another  sigmoid  function  which  says  that  in 
case  of  collision,  the  vehicles  with  more  possi¬ 
ble  paths  should  give  way.  The  swapping  term 
is  most  complicated.  We  leave  out  the  detailed 
explanation  except  saying  that  /(•)  and  /!(•)  are 
appropriately  chosen  highly  nonlinear  functions. 
Lastly,  gate(i)  =  »7i(x<ji,T)  which  stops  the 

T<t 

signed  propagation  for  vehicle  i  once  its  destina¬ 
tion  Xdi  ha«  been  reauihed. 

In  the  equations  above,  we  encourage  all 
possible  paths  to  be  stored  in  the  states  of  the 
neurons.  The  redundancy  in  the  formulation 
makes  this  possible.  When  the  destinations  are 
reached,  we  backtrack  and  choose  one  of  the  best 
paths  computed  by  the  network. 

It  is  obvious  that  in  the  absence  of  colli¬ 
sion,  the  paths  obtained  are  the  original  optimal 
paths  for  a  single  vehicle  where  collisions  are  not 
considered . 

Since  the  problem  is  inherently  time  de¬ 
pendent,  the  neuronal  states  at  large  t  naturally 
wait  for  the  information  from  neurons  at  smaller 
t.  We  may  as  well  solve  the  equation  for  a  fixed 
time  window  w,  namely  we  compute  the  paths 


for  the  next  w  moves.  Then  we  repeat  the  pro¬ 
cedure,  calculating  the  paths  piecewise  until  the 
destinations  are  reached. 

4.  Simulation  results 

We  numerically  integrated  the  above  dy- 
namic2d  system,  using  a  simple  Euler  method,  in 
which  case,  synchronization  does  not  have  to  be 
exactly  enforced.  Recall  that  an  Euler  solver  for 
a  differential  equation  dxfdt  =  f{x)  is  an  itera¬ 
tive  mapping:  Xi+i  =  Xi+ef{xi).  This,  together 
with  the  locality  of  the  computational  stencil  en¬ 
ables  us  to  parallelize  the  above  algorithm  very 
efficiently.  If  we  go  back  to  the  equations  above, 
the  only  global  computation  is  computing  Sj(t), 
which  can  be  obtained  by  locally  updating  the 
sum  within  each  processor  and  combining  the 
result  in  a  binary  tree.  By  iteratively  solving 
a  differential  equation,  exact  synchronization  is 
not  needed  because  the  dynamics  is  continuous. 
In  a  similar  study*^,  but  slightly  modified  dy 
namic  equations,  almost  a  perfect  speedup  was 
obtained  when  it  is  implemented  on  the  Meiko 
Computing  Surface,  a  parallel  machine  with  up 
to  32  transputer  nodes  as  illustrated  in  Figure  4. 
The  differential  form  also  introduces  some  coop¬ 
eration  into  the  algorithm.  This  can  be  observed 
in  the  conflicting  regions,  like  head-on  collision 
and  swapping,  in  which  case  the  neurons  itera¬ 
tively  adjust  their  values,  trying  to  resolve  the 
conflict. 

5.  Discussion 

The  neural  net  yields  paths  which  slightly 
deviate  from  the  optimal  one-vehicle  paths. 
This  is  because  it  is  dominated  by  two  '.ain 
forces:  one  is  the  collision  avoidauice  force  and 
the  other  is  the  single  vehicle  optimal  paths  at¬ 
tractors.  These  attractors  are  the  graphs  deter¬ 
mined  by  the  linking  map  from  a  node  to  the 
destination.  In  our  study  of  the  two  vehicle 
analytic  navigator®  we  are  using  an  algorithm 
which  updates  this  map.  Applying  this  idea  to 
the  neural  net  system  can  improve  the  solutions 
obtained  above.  The  neural  net  four  vehicle  nav¬ 
igator  performs  well,  see  Figure  3.  However, 
with  more  vehicles  and  vauious  possible  paths  for 
each  it  may  perform  less  satisfactorily.  Clearly 
we  only  presented  a  very  initial  study  here.  We 
need  to  look  at  much  more  complex  problems 
including  three  dimensional  navigation.  We  are 
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also  looking  into  the  elastic  net  ideas  of  Durbin 
and  Willshaw  ’’  as  interpreted  by  Simic*  into 
neural  networks.  There  are  important  analogies 
between  track  finding  computer  vision  and 
navigation  which  we  are  exploring  in  an  inte¬ 
grated  research  program'®. 
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Figure  1.  A  path  in  the  cost-terrain  space 


Figure  3.  Four  paths  in  the  cost-terrain  space  calcu¬ 
lated  by  the  neural  net 


Figure  2.  The  two-vehicle  navigator  solution  for  a  Figured.  Speedup  for  4  vehicle  navigator  running  on 

conflict  imposing  terrain  and  a  path  in  the  16-node  Meiko  Computing  Surface 

Cost-Terrain  space. 
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Abstract 

Filtering  data  to  remove  noise  is  an  important 
operation  in  image  processing.  While  linear  filters  are 
common,  they  have  serious  drawbacks  since  they  cannot 
discriminate  between  large  and  small  discontinuities. 
This  is  especially  serious  since  large  discontinuities  are 
frequently  impoitant  edges  in  the  scene.  Howevo',  if  the 
smoothing  action  is  reduced  to  preserve  the  large 
discontinuities,  very  little  noise  will  be  removed  from 
the  data. 

This  paper  discusses  the  parallel  implementation  of 
a  connectionist  network  that  attempts  to  smooth  data 
without  blurring  edges.  The  network  operates  by 
iteratively  minimizing  a  non-linear  error  measure  which 
explicitly  models  image  edges.  We  discuss  the  origin  of 
the  network  and  its  simulation  on  an  iPSC/2.  We  also 
discuss  its  performance  versus  the  number  of  nodes,  the 
SNR  of  the  data,  and  compare  its  performance  with  a 
linear  Gaussian  filter  and  a  median  filter^ 

Introduction 

A  common  operation  in  image  processing  is 
filtering  to  remove  noise.  One  of  the  simplest  methods 
is  to  implement  a  linear  low-pass  filter  by  convolution 
with  a  Gaussian,  or  other,  kernel.  The  availability  of 
dedicated  convolution  processors  makes  this  option 
especially  attractive  for  many  machine  vision  systems. 
Unfortunately,  the  linear  filter  has  serious  drawbacks.  It 
cannot  discriminate  between  large  discontinuities  and 
small  discontinuities.  Nor  can  it  model  the  structure  of 
the  data  to  discriminate  between  conelated  discon¬ 
tinuities,  such  as  edges,  and  random  noise.  Large  coirel- 
ated  discontinuities  are  frequendy  edges  of  objects  in  the 
scene,  which  convey  considerable  information. 
Reducing  the  amount  of  smoothing  in  order  to  preserve 
important  discontinuities  also  reduces  the  amount  of 
noise  removed.  Thus,  linear  Hitering  is  a  compromise 
between  preserving  large  discontinuities  while  still 
removing  noise  from  the  data.  A  good  compromise  can 
be  very  difficult  to  strike. 

As  general-purpose  parallel  processors  become 
more  widely  available,  and  as  their  cost  continues  to 
decline,  the  performance  advantage  of  convolution 
hardware  will  be  reduced.  This  will  allow  more 
sophisticated  filters  to  be  used  without  an  unacceptable 
performance  penalty.  This  paper  discusses  the  piuallel 
implementation  of  a  ‘neural  network’  approach  to  the 
data  smoothing  problem.  The  smoothing  technique  is 


based  on  iterative  minimization  of  a  non-linear  error 
measure.  The  error  measure  has  several  components. 
Squared  error  of  the  solution  from  the  input  data  and 
smoothness  of  the  solution  are  two  of  the  components. 
These  are  very  common  [1,2].  The  unusual  portion  of 
the  error  measure  is  the  introduction  of  ‘breakpoints’ 
across  which  the  smoothing  terms  have  no  weight.  This 
modification  of  the  surface  reconstruction  problem 
appears  to  have  first  been  used  in  [3].  These  terms 
model  edges  in  the  image  and  allow  us  to  smooth  noisy 
data  without  blurring  the  edges  of  objects  in  the  image. 
A  one-dimensional  version  of  the  resulting  energy 
measure  was  piesented  in  [4]  as: 

E(f,h)  =X(fi+l -fi)^(l -hi)  (1) 

i 

+  CD]E(fi-‘*i)2  +  ClZ  hi 

i  i 

where  ff  is  the  smoothed  output  value  from  si'e  i,  hi  is 
a  binary  variable  indicating  the  presence  or  absence  of  a 
discontinuity  between  units  i  and  i+l,  dj  is  the  input 
data  to  unit  i,  Cd  is  the  cost  of  getting  away  from  the 
data  relative  to  the  unit  weight  of  the  interpolation 
term,  and  Cl  is  the  cost  of  inserting  a  discontinuity.  To 
allow  for  sparse  data,  the  summation  is  only  taken  over 
those  points  where  d]  ^  0.  Cd  depends  upon  the  signal 
to  noise  ratio  of  the  input  data. 

The  interpretation  of  this  equation  is  that  if 
(fi  -  fi+i)^  >  Cl,  then  it  is  cheaper  to  pay  the  price  of 
inserting  the  discontinuity  than  continuing  to  smooth 
over  the  large  disparity  in  function  values.  The  data 
filtering  is  performed  by  adjusting  the  fi  and  hi  to 
minimize  E.  Since  the  discontinuity  terms  introduce 
local  minima  into  the  cost  surface,  standard 
minimization  algorithms  will  not  work  very  well. 
Simulated  annealing  was  used  in  [3]  to  perform  the 
minimization. 

Koch  Network 

Koch,  et.  al.  [4],  present  another  method  for 
minimizing  (1)  bas^  on  the  wcffk  of  Hopfield  [5,6]. 
Hc^field  suggested  solving  optimization  problems  by 
changing  the  binary  variable,  hj,  to  a  continuous  [0,1] 
variable  that  is  a  nonlinear  function  of  an  underlying 
state  variable.  Additional  terms  are  introduced  into  the 
energy  function  to  force  the  solution  toward  0  or  1.  The 
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minimization  is  then  carried  out  by  having  a  network  of 
units,  one  for  each  of  the  fi  and  the  hi,  which  update 
their  value  by  the  rules: 


^  -  and  ^ 


(2) 


where  m^  is  the  state  variable  underlying  hi.  We  are  not 
showing  the  time  constants  that  set  the  rate  of  change 
of  the  units.  The  nonlinear  function,  gO.  is  typically 
the  sigmoid  nonlinearity: 


hi  =  g(mi)  = 


1 

1  + 


(3) 


Because  of  the  update  rule,  each  site  takes  a  small 
step  down  the  gradient  of  the  cost  function.  While  each 
site  must  take  many  steps  to  reach  the  minimum  of  the 
function,  the  steps  can  proceed  in  parallel.  Therefore,  for 
a  large  number  of  sites,  the  total  time  to  perform  the 
minimization  should  be  reduced. 

The  function  used  in  this  study  was  proposed  in 
[4].  It  is  for  filtering  two-dimensional  data,  as  opposed 
to  the  one-dimensional  data  smoothing  that  would  be 
performed  by  (1).  The  presence  of  a  'horizontal*  break, 
one  between  fjj  and  fij+l,  is  indicated  by  hi.  Vertical 
breaks,  between  fi  j  and  fi+i  j,  are  indicated  by  vi_  The 
function  used  is: 


E  =  El  +  Ed  +  El  +  Eg  . 

(4a) 

(4b) 

ij 

Ed 

(4c) 

ij 

El  =CvShy(l-hi,j) 

(4d) 

U 


+  Cp  S  hij  hij+i  +  Cel  hij 

ij  U 

+  ClI  hij  [  (1-hi+i  J  -  vij  -  vij+i)2 

ij 

+  (l-hi.ij  -vi.ij-vi.ij+i)2] 


Eg  =Cg^ 
ij 


where  fy  is  the  interpolated  surface  and  hy  and  vy  are 
the  horizontal  and  vertical  line  processes.  Note  that  the 
energy  expression  above  is  only  for  hy.  The  other  half 
of  the  expression  can  be  obtained  by  replacing  hy  with 
vji,  substituting  i  for  j  and  vice  versa.  The  first  t»m  in 
El  forces  hy  to  either  0  w  1,  the  second  term  penalizes 
the  formation  of  parallel  lines,  the  third  term  is  the 
constant  price  that  is  paid  for  introducing 
discontinuities,  and  the  fourth  term  is  an  interaction 
term  which  favors  continuous  lines  while  penalizing 
multiple  line  intersections,  line  crossings,  or 
discontinuous  line  segments,  llie  line  outputs  hy  and 
vy  are  functions  of  internal  state  variables.  Since  the 
discontinuity  terms  are  asymtotic  to  0  and  1,  the  update 
rules  would  drive  these  state  variables  to  ±  <»  in  a  futile 
attempt  to  drive  the  visible  outputs  to  0  or  1.  The  Eg 
term  prevents  this  by  penalizing  excessive  values  for 
the  state  variables.  The  smoothed  data  output  and  the 
internal  state  variables  are  updated  according  to: 

dfj  9E  dmj  9E  dnj  9E 

dt  ~  dfi  ’  dt  ~  dhi  ’  dt  ~ 


Parallel  Implementation 

The  network  was  simulated  on  an  iPSC/2  SX 
(Weitek  FPUs)  under  HIP,  the  Hypercube  Image 
Processor  [7].  HIP  is  system  for  interactive  image 
processing,  as  well  as  a  framework  for  developing 
parallel  image  processing  algorithms.  By  providing 
pcdefined  image  decompositions,  I/O  proc^ures,  and  a 
body  of  image  processing  functions,  HIP  reduces  the 
effort  required  to  develop  parallel  image  processing 
algorithms.  HIP  supports  floating-point  image  buffers 
in  addition  to  character  and  integer  types.  It  also 
supports  multi-spectral  image  buffers.  HIP’s  image 
buffers  have  a  simple  decomposition.  Each  image  is 
divided  into  as  many  horizontal  strips  as  there  are  nodes 
and  each  strip  is  given  to  a  different  node.  Each  snip  is 
provided  with  a  border  of  data  to  hold  initial  conditions 
for  convolutions  and  similar  neighborhood  operations. 
The  top  and  bottom  of  the  border  is  updated  from  the 
active  region  of  the  two  neighboring  nodes.  The  sides  of 
the  border  are  updated  by  replicating  the  first  and  last 
column. 

The  network  was  simulated  by  using  a  floating¬ 
point  image  buffer  with  S  spectral  b^ds.  The  flrst  band 
holds  the  state  of  the  smoothed  output  values,  the 
second  and  third  hold  the  horizontal  and  vertical 
discontinuities,  while  the  fourth  and  fifth  hold  the  state 
variables  underlying  the  horizontal  and  vertical 
discontinuities,  mi  and  ni.  The  input  data,  di,  comes 
from  a  seperate  image  buffer.  At  each  iteration,  all  the 
sites  in  the  smoothed  output  layer  are  updated  according 
to  the  update  rules  in  (S).  At  the  end  of  each  iteration, 
the  border  data  is  updated  with  the  new  values  from  the 
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neighboring  nodes.  The  discontinuities  are  not  updated 
every  iteration.  The  frequency  of  their  update  is 
controlled  by  a  command-line  option.  For  the  results 
presented  in  this  paper,  they  were  updated  every  4 
iterations. 

The  depth  layer  is  initialized  with  the  values  of  the 
image  to  be  smoothed,  while  the  edges  and  their 
underlying  state  variables  are  intialized  to  the  middle  of 
their  range.  The  minimization  procedure  was  terminated 
when  the  value  of  the  cost  function  stopped  decreasing. 
The  number  of  iterations  this  took  depends  upon  the 
data  and  the  settings  of  the  time  constants  not  shown  in 
equation  (2),  but  was  generally  between  10  and  20 
iterations. 


2  4  8  16  32 


Number  of  Nodes 


Figure  1:  Time  /  Iteration  vs.  Numbw  of  Nodes 
and  Image  Size 


Performance 

There  are  several  facets  of  the  network’s 
perfonnance  to  characterize.  We  are,  of  course,  interested 
in  how  well  it  parallelizes.  We  must  also  be  interested 
in  the  quality  of  filtering  it  performs  and  its  ease  of  use. 
We  will  discuss  these  in  order. 

Since  the  filter  is  iterative,  it  is  impossible  to 
predict  in  advance  how  many  iterations  will  be  required 
for  termination.  For  this  reason  we  will  report  the 
execution  time  in  two  fashions.  When  we  look  at  how 
well  the  network  parallelizes,  we  report  the  time  taken 
to  complete  a  single  iteration  that  updates  the  depth  and 
discontinuity  layers.  When  we  compare  the  filter  with 
other  filters,  we  will  report  total  times. 

Since  each  site  is  updated  only  as  a  function  of  the 
sites  in  a  four  nearest  neighborhood,  we  would  expect  to 
see  nearly  linear  speedup  as  we  increase  the  number  of 
nodes.  We  would  also  expect  the  execution  time  to  be 
directly  proportional  to  the  size  of  the  data  set.  This  is 
just  what  is  shown  in  figures  1  and  2,  which  report  the 
time  to  complete  one  iteration  as  a  function  of  the 
number  of  nodes  and  the  size  of  the  image.  Figure  2 
uses  a  log  scale,  and  shows  almost  perfectly  linear 
behavior.  The  speedup  coefficients  are  0.86, 0.91, 0.94, 
and  0.97  for  the  64  ..  512  image  sizes,  respectively. 

Note  that  the  data  points  missing  from  figures  1 
and  2  are  due  to  insufficient  node  memory  to  hold 
large  images  on  few  nodes.  The  log  scale  shows  an 
anomaly  for  the  64x64  image  on  32  nodes,  which 
requires  some  explanation.  Recall  the  border  of  initial 
conditions  data  that  HIP  provides  in  its  image 
decomposition.  If  the  border  has  more  lines  in  it  than 
the  neighboring  node  has  in  its  active  region,  several 
communication  steps  will  be  needed  to  update  the 
borders.  HIP  determines  if  this  is  the  case  and  uses  a 
fast  border  update  if  possible,  and  a  slow-but-sure  update 
if  not.  A  64x64  image  on  32  nodes  has  only  two  rows 
in  its  active  region,  so  we  are  seeing  a  different  border 
update  procedure.  The  alert  reader  may  be  asking  why 
more  than  a  single  row  in  the  border  is  needed. 
Actually,  it  is  not  This  is  a  just-discovered  coding  error 
in  HIP  which  will  be  corrected  before  it  is  released  to 
the  iSC  User  Group  library. 


Number  of  Nodes 

Figure  2:  Time  /  Iteration  vs.  Number  of  Nodes 
and  Image  Size  (Log  Scale) 

The  timings  above  were  all  for  a  single  iteration  of 
the  network.  We  also  need  to  know  the  total  time  for 
the  network  to  converge.  These  ate  given  below  in  table 
1  for  the  16  node  case. 

The  Koch  network  is  not  the  only  non-linear  data 
filter  available.  We  decided  to  compare  its  performance 
with  two  other  filters,  a  SxS  linear  Gaussian  low-pass 
filter  and  a  SxS  median  filter.  An  artificial  image  was 
generated  and  corrupted  with  different  amounts  of  noise. 
The  three  filters  were  applied  and  their  execution  time 
noted.  Finally,  the  sum  of  the  squared  errors  were 
computed.  The  image  used  is  shown  below  in  figure  4a. 
It  is  128x128  with  the  darkest  gray  level  at  2S  and  the 
brightest  at  229,  with  the  other  3  levels  at  76, 127,  and 
178.  The  noise  added  was  uniformly  distributed  and  0- 
mean.  Two  magnitudes  were  used,  from  -2S..2S  and  - 
12.5.. 12.5.  These  correspond  to  SNRs  of  13  dB  and  20 
dB,  respectively.  The  outputs  of  the  filters  for  the  13  dB 
SNR  are  shown  in  figures  4b..4e.  Figure  4f  shows  the 
horizontal  discontinuities  detected  by  the  Koch  network, 
the  vertical  discontinuities  are  similar.  The  filters  were 
also  applied  to  the  uncorrupted  image  to  see  what 
damage  they  would  inflict  upon  perfect  data.  The  sums 
of  the  squared  errors  (SSE)  are  plotted  below  in  figure  3. 
A  perfect  filter  would  be  a  flat  line  at  the  bottom  of  the 
graph.  The  steepest  line  is  the  SSE  for  the  unfiltered, 
noisy,  image.  Data  points  above  this  line  show  a  filter 
that  is  doing  more  harm  than  good. 


IS6 


method  for  determining  them.  Furthermore,  the  param- 
etos  are  sensitive  to  the  magnitude  of  the  data.  0..1  data 
requires  different  parameters  than  0..2SS. 

Figure  4f  shows  another  problem  of  the  network. 
Its  small  neighborhood  size  makes  it  sensitive  to  tiny 
regions  of  correlated  noise.  This  can  be  overcome  by  the 
use  of  multiresolution  techniques  [8],  which  seem  to 
give  excellent  results.  We  hope  to  add  image  pyramid 
buffers  to  HIP  in  the  future  in  order  to  attempt  to  dup¬ 
licate  the  results  in  [8]. 


Figure  3:  Sum  of  Squared  Error  vs. 

Filter  Type  and  Noise  Level 

This  figure  shows  that  the  Koch  network  performs 
much  better  than  the  linear  Gaussian  filter,  but  not  as 
well  as  the  median  Alter.  It  also  takes  much  longer  to 
execute,  as  shown  below  in  table  1.  These  are  the  times 
on  16  nodes  for  the  13  dB  SNR  images. 

Table  1:  Execution  Times  of  Filters  (sec.) 

Image  Size  Gaussian  Median  Koch 


64 

.577 

.483 

10.47 

128 

1.1 

.883 

19.48 

256 

2.26 

4.14 

41.91 

512 

3.54 

14.18 

63.86 

Conclusions 

Neural  network  approaches  to  machine  vision  tasks 
are  the  focus  of  a  great  deal  of  research  interest.  Part  of 
the  reason  for  this  is  because  of  their  massive  parallel¬ 
ism.  Their  fine-grained  structure  allows  them  to  be 
mapped  onto  almost  any  parallel  architecture,  although 
networks  that  are  almost  completely  interconnected  will 
pay  a  performance  penalty.  Those  networks  with  res¬ 
tricted  interconnections  between  units,  such  as  the  Koch 
network,  are  especially  easy  to  implement  on  distrib¬ 
uted-memory  computers.  This  is  shown  by  the  exceUent 
speedup  as  the  number  of  nodes  increased  and  the  0(N) 
behavior  as  the  data  size  increased. 

While  the  Koch  network  is  easily  parallelized,  so 
are  many  standard  Alters  used  in  image  processing.  The 
median  Alter  performs  better  than  the  Koch  network  by 
all  the  measures  we  used.  It  also  has  the  advantage  of  a 
predictable  execution  time.  Since  the  Koch  netwoik  is 
iterative,  one  can  never  be  entirely  sure  how  long  it  will 
take  to  complete.  The  Koch  network  does  have  an 
advantage  over  the  median  Alter  for  the  case  where  a 
dedicated  VLSI  implementation  is  considered.  The 
update  rules  in  (S)  can  be  implemented  by  analog  com¬ 
putation,  which  would  offer  a  tremendous  performance 
improvement  over  the  digital  multiply  and  add. 

Another  disadvantage  to  the  Koch  network  is  that  it 
is  hard  to  use.  There  are  many  parameters  that  must  be 
balanced  to  achieve  good  performance  and  no  a  priori 
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ABSTRACT 

One  of  the  intermediary  stages  of  image  analysis 
in  vision  is  the  process  of  component  labeling.  Given 
a  digital  black/ white  image  distributed  throughout 
the  nodes  of  an  Intel  iPSC/2  hypercube,  the  objec¬ 
tive  of  this  research  is  to  develop  and  implement  effi¬ 
cient  parallel  algorithms  for  labeling  the  (black)  con¬ 
nected  components.  The  basic  solution  strategy  is 
based  on  divide-and-conquer,  in  which  each  node  ini¬ 
tially  labels  the  subimage  that  it  is  responsible  for. 
The  results  of  the  local  labeling  are  then  combined  us¬ 
ing  boundary-overlapping  resolution  strategy.  In  an 


effort  to  develop  algorithms  that  are  both  time  and 
space  efficient,  we  consider  manipulating  various  data 
structures,  using  a  variety  of  sequential  and  paral¬ 
lel  component  labeling  schemes,  and  performing  load 
balancing  techniques  to  maximize  parallelism.  The 
images  currently  under  experiment  range  from  real 
pictures  extracted  from  scanners,  to  medical  X-rays, 
to  digital  images  generated  with  respect  to  certain 
constraints.  Experimental  results,  analysis,  and  in¬ 
terpretation  of  various  algorithms  are  presented. 


This  work  was  partially  supported  by  NSF 
grants  IRI-8800514  and  ASC-8705104. 
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1.  Introduction 

A  digitized  black/white  image,  also  known  as 
a  binary  image,  consists  of  a  two-dimensional  array 
of  pixels  with  foreground  pixels  having  the  value  1 
(black),  and  background  pixels  having  the  value  0 
(white).  A  (connected)  component  in  an  image  is  de¬ 
fined  to  be  a  set  of  maximally  connected  foreground 
pixels,  where  two  pixels  are  connected  if  and  only 
if  they  are  adjacent.  There  are  two  common  defi¬ 
nitions  of  adjacency.  In  the  first  definition,  known  as 
4- connectivity,  two  black  pixels  are  defined  to  be  ad¬ 
jacent  if  and  only  if  one  pixel  is  directly  above,  below, 
to  the  left  or  to  the  right  of  the  other  pixel.  In  the 
second  definition,  known  as  8-connectivity,  two  black 
pixels  are  defined  to  be  adjacent  if  and  only  if  one  is 
one  of  the  eight  closest  pixels  of  the  other.  The  prob¬ 
lem  of  component  labeling  is  to  assign  a  unique  label 
to  every  component  in  an  image  so  that  every  pixel  is 
assigned  its  component’s  label. 

Our  goal  is  to  design  algorithms  for  labeling  the 
components  of  a  digitized  image  on  a  hypercube.  The 
algorithms  presented  assume  that  the  binary  image  to 
be  processed  has  been  partitioned  into  vertical  slices 
and  distributed  throughout  the  nodes  so  that  each 
node  is  responsible  for  a  unique  strip  as  illustrated  in 
figure  1. 

Nodes:  abed 


Figure  1.  Dividing  the  image  and  distributing  it 
amongst  the  nodes 

We  perform  component  labeling  based  on  divide- 
and-conquer  in  two  major  steps;  the  first  step  consists 
of  a  sequential  component  labeling  algorithm  that  is 
applied  to  each  vertical  slice  (base  case),  the  sec¬ 
ond  step  consists  of  a  strategy  for  resolving  the  con¬ 
flicts  between  the  boundary  labels  of  the  neighboring 
subimages  (conquer). 

In  section  2,  we  give  a  brief  description  of  an 
approach  to  the  sequential  step.  Section  3  covers 
two  different  algorithms  for  solving  the  problem  of 


boundary-overlap  resolution.  Section  4  discuses  an 
algorithm  for  the  path  resetting  problem  in  a  resolu¬ 
tion  table  based  on  a  series  of  union/find  operations. 
Load  balancing  techniques  are  presented  in  section  5, 
and  finally  the  timing  results  are  given  at  the  end  of 
the  paper. 

2.  Sequential  Algorithm 

The  sequential  labeling  algorithm  that  we  use  is 
based  on  a  two-pass  labeling  scheme  similar  to  [1]. 
During  the  first  pass  the  image  is  examined  row  by 
row  (top-down,  left  to  right)  while  labeling  the  fore¬ 
ground  pixels  as  follows. 

i)  Assign  a  new  label  to  a  foreground  pixel 
that  is  not  connected  to  any  other  previ¬ 
ously  labeled  foreground  pi.xel. 

ii)  Assign  the  same  label  to  a  foreground  pixel 
that  is  connected  to  a  previously  labeled 
foreground  pixel. 

iii)  If  there  are  two  adjacent  pixels  with  differ¬ 
ent  labels,  create  an  entry  in  the  resolution 
table  indicating  that  the.two  labels  must  be 
resolved  during  the  second  pass. 

Once  the  first  pass  is  complete,  some  of  the  com¬ 
ponents  are  not  labeled  in  a  consistent  fashion  (as 
discussed  in  iii  above).  The  resei.palhs  procedure, 
discussed  in  section  4  takes  the  resolution  table  and 
resolves  the  conflict  among  the  labels  producing  a  fi¬ 
nal  table  which  contains  a  list  of  the  labels  resolved 
and  their  new  values. 

During  the  second  pass  through  the  image,  as 
each  foreground  pixel  is  examined  a  search  is  per¬ 
formed  in  the  resolution  table  to  see  if  the  label  of 
the  pixel  should  be  updated. 

3.  Conflict  Resolution 

Since  the  sequential  algorithm  is  applied  to  each 
subimage  independently,  it  is  expected  that  inconsis¬ 
tencies  in  the  assigned  labels  exist  for  objects  that  lie 
across  the  boundaries  of  subimages. 

In  this  section  we  describe  algorithms  to  resolve 
the  inconsistencies  at  the  boundaries  of  subimages. 
We  first  turn  our  attention  to  a  simple  approach  to 
solve  the  problem,  and  later  present  load  balancing 
techniques  to  speed  up  the  process. 

Conceptually  we  arrange  the  nodes  of  the  hyper- 
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2)  In  doing  step  ii,  it  is  necessary  to  have  a  list  of 
pixel  positions  corresponding  to  a  label  in  a  given 
array  of  boundary  pixels. 

The  table  is  therefore  initially  set  up  to  contain 
an  entry  corresponding  to  each  label  and  sorted  by 
incrccksing  value  of  the  labels  so  that  adding  entries 
would  consist  of  performing  a  search  on  the  label  and 
assigning  a  new  value  to  it.  For  every  array  of  bound¬ 
ary  pixel  there  is  a  list  of  labels  created  with  each  label 
having  a  list  of  pixel  positions  assigned  that  label. 

The  above  algorithm  can  then  be  implemented  as 
procedure  resolve.overlaps.  Given  a  foreground  pixel 
at  position  x  labeled  /,  two  arrays  of  boundary  pixels 
co/1  and  co/2,  two  lists  of  labels  lisil  and  lisi2,  and 
a  resolution  table  T  the  procedure  finds  the  pixels 
adjacent  to  it  across  the  boundary  and  resets  their 
label  in  the  table  to  /  (we  refer  to  the  adjacent  pixels 
across  the  arrays  as  “neighbors”). 


resolvejoverlaps 

procedure  resolve javerlaps  (co/1 ,  co/2 , 

lisil  ,list2 ,T ,l ,x) 
for  each  i  in  neighborsoi(x,co/2)  do 
k  :=  label  at  i 
tempk  :=  k 
if  (ifc,  newk)  G  T 
k  :=  newk 
if(.kjtl) 

add  (Jfc,/)  to  T 

setof pixels  seeLrch(/is<2,<empib) 

for  each  j  €  setofpixels  do 
resol  vejoverlaps  (co/2,  co/1 , 
list2,listl  ,T  ,l  .j) 

end  for 
end  if 
end  for 


A  complete  resolution  table  can  then  be  formed 
by  successive  calls  to  resolve.overlaps  for  all  the  fore¬ 
ground  pixels  in  one  of  the  arrays  as  follows; 


for  each  black  pixel  x  in  co/1  do 
/  :=  label  of  x 
if  (/,  newl)  G  T 
I  :=  newl 

resolvejoverlaps(co/l , co/2, 
lisil  ,lisl2,T ,l,x) 


3.1.2  Alternative  Approach 

The  alternative  approach  proceeds  by  pairing  la¬ 
bels  of  adjacent  foreground  pixels  in  the  two  arrays 
of  boundary  pixels  and  forming  a  table  (figure  3). 
Next,  the  table  is  sent  to  the  path  resetting  algorithm 
discussed  in  section  4,  whereby  a  resolution  table  is 
obtained.  The  advantage  of  this  over  the  initial  ap¬ 
proach  is  that  the  find  and  union  operations  used  in 
the  path  resetting  algorithm  use  path  compression  so 
that  successive  search  operations  in  the  forest  can  be 
performed  more  efficiently. 

A  variation  of  this  algorithm  that  we  have  imple¬ 
mented  avoids  making  some  duplicate  entries  in  the 
table  as  much  as  possible  so  that  the  input  to  the 
path  resetting  algorithm  would  be  smaller.  This  is 
done  by  comparing  the  current  label  pair  with  the 
most  recent  entry  inserted  in  the  table,  and  insert  the 
current  entry  only  if  it  is  diflTerent. 

In  figure  3  an  example  of  an  input  to  the  path 
resetting  algorithm  set  up  by  above  scheme  is  shown. 
The  two  boundary  pixel  arrays  of  figure  2b  are  con¬ 
sidered,  and  the  duplicate  entries  are  omitted. 
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Figure  3.  Table  set  up  for  the  path  resetting 
algorithm 

The  conflict  resolution  table  given  in  figure  2c  is 
the  result  of  the  path  resetting  algorithm  applied  on 
figure  3. 


4.  Path  Resetting 

The  sequential  algorithm  discus.sed  in  section  2 
and  the  boundary  resolution  algorithm  discii.ssed  in 
section  3.1.2  produce  tables  of  the  form  illustrated  in 
figure  3.  In  these  tables  for  every  label  pair  (/l,/2) 
there  can  be  a  corresponding  label  pair  (/I,/3)  or 
(/2, /3).  In  either  case  the  labels  /I,  12  and  13  are  to 
be  resolved  to  a  unique  label  by  modifying  the  table 
to  contain  the  label  pairs  (/I, /I),  (/2,/l)  and  {13, ll) 
instead.  Path  resetting  is  the  process  of  grouping  the 
labels  into  disjoint  sets  where  an  element  belongs  to 
a  .set  if  and  only  if  there  is  an  entry  corresponding  to 
that  label  and  another  member  of  the  same  set  in  the 


end  for 


cube  in  a  linear  array,  with  each  node  having  a  la¬ 
beled  slice  of  the  image.  The  algorithm  proceeds  by 
entering  a  loop.  In  each  iteration,  the  nodes  in  the 
linear  array  are  paired,  the  boundary-overlap  resolu¬ 
tion  algorithm  is  applied  to  each  pair,  the  boundaries 
are  consequently  updated,  and  then  pairs  of  slices 
are  combined  into  bigger  slices,  thereby  reducing  the 
number  of  slices  to  be  processed  by  2.  Figure  2a  illus¬ 
trates  the  array  of  processors  for  a  4-node  hypercube. 
In  the  first  step  all  of  the  4  nodes  are  participating  in 
the  process  of  conflict  resolution,  and  then  groups  of 
two  are  formed  in  the  second  step,  where  conflict  reso¬ 
lution  only  occurs  in  the  nodes  lying  at  the  boundaries 
of  the  bigger  slices  (i.e.  b  and  c). 


(b)  (0 


Figure  2.  Conflict  resolution  occurring  in  each 
iteration  (a)  linear  array  of  nodes 
(b)  boundary  pixel  arrays  (c)  reso¬ 
lution  table 

3.1  Boundary  Resolution  Algorithm 

In  each  iteration  of  the  conflict  resolution  algo¬ 
rithm  labels  are  resolved  at  the  boundaries  of  each 
pair  of  slices  in  the  linear  array  of  nodes  as  discussed 
previously.  As  the  boundary  resolution  occurs  be¬ 
tween  two  of  the  nodes  located  at  the  boundaries  of 
two  vertical  strips,  one  of  the  nodes  sends  a  boundary 
pixel  array  to  the  other  node,  which  is  in  turn  respon¬ 
sible  for  performing  the  boundary-overlap  resolution 
and  distributing  the  results  to  all  the  nodes  in  the 
two  vertical  strips.  For  example,  consider  the  con¬ 
figuration  of  the  linear  array  of  nodes  in  the  second 
iteration  illustrated  in  figure  2a.  Node  b  sends  the 
rightmost  array  of  boundary  pixels  to  node  c.  Node  c 


runs  the  boundary-overlap  resolution  algorithm  and 
passes  the  results  to  nodes  a,  b,  and  d. 

The  boundary-overlap  resolution  algorithm  takes 
aa  input  two  arrays  of  boundary  pixel  labels  and  re¬ 
turns  a  resolution  table  for  updating  the  boundary 
pixels.  An  example  of  arrays  of  boundary  pixel  la¬ 
bels  used  in  the  process  of  conflict  resolution  is  illus¬ 
trated  in  figure  2b,  where  a  0  denotes  a  background 
pixel.  The  resulting  resolution  table  for  this  example 
is  shown  in  figure  2c. 

We  present  two  different  approaches  for  solving 
the  problem  of  boundary-overlap  resolution. 

3.1.1  Initial  Approach 

Initially,  we  used  a  recursive  algorithm  for  re¬ 
solving  the  inconsistencies  in  the  labels  assigned  to 
adjacent  foreground  pixels  across  the  boundary  (note 
that  adjacency  is  defined  by  8- connectivity).  The  idea 
is  to  take  foreground  pixels  from  one  array  and  a  ta¬ 
ble  T  (initially  empty)  and  perform  the  following  for 
each  pixel  x: 

i)  Let  /  be  the  label  of  * .  If  there  is  a  (/,  newl) 
in  T  let  /  :=  newl. 

ii)  For  each  y  an  adjacent  pixel  of  x: 

a)  Let  k  be  the  label  of  y.  If  there  is  a 
(k,  newk)  in  T  let  k  ;=  newk. 

b)  If  it  ^  /,  add  {kj)  to  T,  and  for  all 
foreground  pixels  having  the  same  label 
as  y  repeat  step  ii  with  x  =  y. 

The  labels  of  the  foreground  pixels  change  as 
the  algorithm  proceeds.  Therefore  whenever  labels 
of  foreground  pixels  are  to  be  used,  there  is  a  search 
performed  on  the  label  in  the  resolution  table  to  check 
if  there  is  a  new  value  associated  with  it. 

Applying  the  above  strategy  to  the  arrays  of  fig¬ 
ure  2b,  we  will  have  1  =  5  for  the  third  foreground 
pixel  from  the  top  in  the  left  array.  The  adjacent  pix¬ 
els  with  inconsistent  label  are  both  labeled  2.  The  set 
of  indices  of  all  the  pixels  labeled  2  is  {2,3, 10).  For 
each  of  the  elements  of  the  set  step  ii  is  repeated  with 
/  =  5. 

Formulating  the  above  into  a  procedure,  we  note  that; 

1)  If  labels  are  to  be  searched,  the  table  must  be 
kept  sorted  by  increasing  order  of  ll  for  each 
(11,12)  entry  so  that  binary  search  could  be  ap¬ 
plied. 
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table. 

Given  a  resolution  table  T,  procedure  reaei.paihs 
proceeds  by  sorting  the  labels  in  T  ascending  order 
and  placing  them  in  a  forest  F  where  each  label  is 
placed  at  the  root  of  a  tree.  Grouping  of  labels  into 
disjoint  sets  can  then  be  achieved  by  a  series  of  find 
and  union  operations  [2].  The  end  result  is  then  ob¬ 
tained  by  taking  the  root  of  each  tree  in  the  forest  as 
the  representative  of  the  tree  and  create  a  new  table 
by  pairing  every  label  in  each  of  the  trees  with  its 
root. 


reset  .paths 

init  Jorast(F,T) 
tor  each  (/1,/2)€T 

rootl  ;=  lind(F,/l) 
rooi2  :=  findCF./2) 
if  rootl  ^  rooi2  then 

union(F, rootl  ,roo<2) 

end  for 

tempT  :=  an  empty  table 
for  each  tree  5  in  F  do 
root  :=  rootof(F,S) 
for  each  /  €  5  do 

add  {I, root)  to  tempT 
end  for 
end  for 
return  tempT 


5.  Load  Balancing 

In  order  to  balance  the  load  on  all  the  processors, 
the  process  of  boundary  resolution  between  two  ver¬ 
tical  strips  performed  by  a  single  processor  discussed 
in  section  3.1  can  be  divided  among  all  the  nodes 
within  each  vertical  strip  (figure  2a).  By  allowing  ev¬ 
ery  node  in  a  group  of  nodes  in  a  vertical  strip  to 
perform  boundary  resolution  on  a  subdivision  of  the 
two  boundary  pixel  arrays,  intermediary  resolution 
tables  can  be  formed  which  are  then  merged  together 
to  result  a  final  resolution  table.  The  finale  resolu¬ 
tion  table  is  then  distributed  among  the  nodes  in  the 
group. 

As  an  example  consider  the  configuration  shown 
in  figure  2a,  where  boundary  resolution  is  occuring 
between  nodes  b  and  c  in  the  second  iteration  of  the 
algorithm.  We  divide  the  two  arrays  of  boundary  pix¬ 
els  by  the  number  of  nodes  and  distribute  them  as  in 
figure  4.  The  subparts  of  the  boundary  arrays  have 


a  pixel-long  overlapping  region,  (i.e.  the  last  entry  in 
the  arrays  sent  to  one  node  is  the  same  as  the  first  en¬ 
try  of  the  arrays  in  the  neighboring  node)  so  that  the 
diagonally  adjacent  pixels  at  the  end  of  each  subarray 
be  considered  in  the  process  of  boundary  resolution. 


Figure  4.  Dividing  the  boundary  pixel  arrays 
into  subparts 


6.  Timing  Results 
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Figure  5.  The  images  for  which  timing  results  were 
measured. 
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Figure  5  illustrates  the  images,  for  which  the  tim¬ 
ing  results  are  given  in  this  section.  The  images  are 
512x512  matrices  of  black  and  white  pixels.  The  algo¬ 
rithms  are  exmained  on  an  Intel  iPSC/2  Hypercube. 

The  runtime  of  the  sequential  algorithm  is  pro¬ 
portional  to  the  number  of  pixels  in  the  image  and  the 
pattern  of  the  image  [3].  Therefore  the  nodes  with 
subimages  that  have  many  partial  segments  of  con¬ 
nected  components  will  have  a  higher  load,  as  there 
are  more  entries  in  the  resolution  table  to  be  resolved 
by  the  path  resetting  algorithm.  In  the  case  of  images 
such  as  the  ones  illustrated  in  figures  5a  and  5c,  the 
nodes  will  have  the  same  amount  of  load  due  to  the 
symmetry  of  the  images  except  for  the  leftmost  and 
rightmost  node  in  the  linear  array  of  nodes. 


First  Algorithm 

Picture 

Elapsed  Time(ms) 

Snake 

1546 

Emblem 

2962 

Spiral 

1859 

ID 

3215 

Table  2.  The  execution  time  of  the  initial  algo¬ 
rithm  on  32  nodes. 

Tables  1,  2  and  3  provide  the  elapsed  execution 
time  of  the  code  consisting  of  the  sequential  and  the 
conflict  resolution  algorithms. 

The  results  of  the  alternative  algorithm  for  con¬ 
flict  resolution  show  a  considerable  improvement  over 
the  initial  algorithm  specially  in  the  case  of  Emblem 
and  ID  where  the  elapsed  time  using  the  alternative 
algorithm  is  less  than  half  of  that  of  the  initial  algo¬ 
rithm.  The  disadvantage  of  the  initial  algorithm  is 
due  to  the  fact  that  for  every  new  pair  (/l,/2)  added 
to  the  resolution  table  all  (/3,/l)  entries  added  in  the 
preceding  recursion  levels  must  be  reset  to  (13,12). 
Whereas  in  the  case  of  the  alternative  algorithm  reset¬ 
ting  a  set  of  previously  resolved  labels  consist  of  per¬ 
forming  a  union  operation  on  two  disjoint  sets  which 
has  a  running  time  of  0(1). 


Second  Algorithm 

Picture 

Elapsed  Time(ms) 

Snake 

1078 

Emblem 

1441 

Spiral 

1061 

ID 

1417 

Table  3.  The  execution  time  of  the  alternative  al¬ 
gorithm  on  32  nodes. 

There  is  certain  amount  of  overhead  involved 
with  the  load  balancing  technique  we  have  used.  As 
the  boundary-overlap  resolution  algorithm  is  applied 
to  each  node  in  a  group  of  nodes  in  a  vertical  slice, 
a  resolution  table  is  produced.  These  intermediary 
resolution  tables  must  be  merged  to  result  a  final  res¬ 
olution  table.  Distributing  the  boundary  pixel  arrays 
and  merging  the  intermediary  resolution  tables  are 
the  major  overhead  factors  responsible  for  the  non¬ 
linearity  of  boundary-overlap  resolution  running  time 
with  respect  to  the  number  of  nodes  partaking  in  the 
process  of  boundary  resolution. 


Load  Balancing 

Picture 

Elapsed  Time(ms) 

Snake 

1042 

Emblem 

1356 

Spiral 

1017 

ID 

1370 

Table  4.  The  execution  time  of  the  final  algorithm 
with  load  balancing  on  32  nodes. 

7.  Future  Research 

Although  the  load  balancing  technique  we  have 
introduced  does  show  an  improvement  to  the  conflict 
resolution  algorithm,  the  overhead  is  still  considerable 
and  as  the  number  of  nodes  increases  the  elapsed  time 
for  the  conflict  resolution  becomes  comparable  with 
the  execution  time  of  the  sequential  algorithm.  We 
are  currently  concentrating  on  reducing  the  overhead 
associated  with  the  load  balancing  technique. 
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Abstract 

The  digital  halftone  resolution  problem  may  be  stated 
2is  follows:  given  an  n  x  n  array  V  of  real  numbers, 
Kj  G  [0, 1],  produce  an  n  x  n  array  w  of  binary  in¬ 
tegers,  u)i,j  e  {0, 1},  such  that  w,  when  displayed  on  a 
binary  output  device  such  as  a  computer  monitor  or  laser 
printer,  is  a  “good”  representation  of  the  real  informa¬ 
tion,  the  intensities  contained  in  V.  Standard  halftone 
resolution  algorithms,  such  as  ordered  dither,  often  mask 
specular  information  contained  in  image  data.  A  new  al¬ 
gorithm,  based  on  feedback  neural  networks,  is  described 
in  detail  and  shown  to  provide  an  enhanced  specular  in¬ 
formation  display.  Three  parallel  implementations  and 
one  parallel/vector  implementation  on  an  Intel  iPSC/2 
hypercube  are  described.  The  programs  are  run  on  two 
images,  one  of  size  (256  x  256),  and  the  other  of  size 
(1024  X  1024).  Parallelism  results  in  a  speedup  of  7.6, 
with  an  efficiency  of  95%,  using  eight  processors.  Vec- 
torization  provides  a  time  improvement  of  approximately 
2.5  over  nonvector  implementations. 

keywords:  digital  halftone,  neural  networks,  fixed- 
point  iteration,  parallel  simulation,  iPSC/2  hypercube 

1.  Introduction. 

The  digital  halftone  resolution  problem  may  be  stated 
as  follows:  given  an  n  x  n  array  V  of  real  numbers, 
Vi  j  €  [0, 1],  produce  an  n  x  n  array  w  of  binary  in¬ 
tegers,  u>ij  6  {0,1},  such  that  w,  when  displayed  on 
a  binary  output  device  such  as  a  computer  monitor  or 
laser  printer,  is  a  “good”  representation  of  the  real  in¬ 
formation,  the  intensities  contained  in  V.  The  obvious 
resolution  algorithm,  round  the  values  in  V,  fails  to  sat¬ 
isfy  most  interpretations  of  “good”.  For  instance,  if  Vjj 
=  .4999999  for  all  i,  j,  then  Wj  j=0  for  all  t,  j,  and  a 
desired  gray  image  is  displayed  as  white.  Consideration 
of  neighborhood  intensities  seems  imperative. 

Many  halftone  resolution  algorithms  have  been 
proposed  (see  [8]).  The  most  commonly  used  is  probably 
the  ordered  dither  [2],  in  which  we  tile  the  image  matrix 
V  with  a  smaller  fixed  array  D  of  threshold  values,  and 
then  turn  on  the  pixel  (set  w,,^=l)  if  and  only  if  Vij 
exceeds  the  corresponding  threshold  value.  A  standard 


4x4  tile  is 


1/32 

17/32 

5/32 

21/32  ■ 

25/32 

9/32 

29/32 

13/32 

7/32 

23/32 

3/32 

19/32 

31/32 

15/32 

27/32 

11/32 

Note  that  a  uniform  intensity  of  0.5  would  cause  8  of 
every  16  pixels  (every  other  one)  to  be  turned  on. 

In  Figure  1  we  show  a  1024x  1024  pixel  image  of  a 
ray-traced  scene  containing  two  spheres,  a  checked  floor, 
three  walls,  and  a  rectangular  mirror  (on  the  back  wall). 
This  image  was  resolved  using  ordered  dither  D,  and 
was  printed  on  a  conventional  300  pixel  per  inch  laser 
printer. 

Although  this  image  offers  reasonable  shading,  we 
contend  that  the  ordered  dither,  as  well  as  other  com¬ 
monly  used  halftoning  algorithms,  can  mask  much  of  the 
specular  information  available  in  the  data.  For  this  rea¬ 
son  we  have  developed  an  alternative  algorithm  whose 
primary  purpose  is  enhanced  specular  information  dis¬ 
play. 

In  section  2  we  provide  the  theoretical  founda¬ 
tions  of  this  algorithm,  and  in  section  3  we  describe  its 
implementation  on  an  Intel  iPSC/2  hypercube.  Section 
4  contains  conclusions  and  current  directions. 

2.  Algorithm  Design. 

Our  algorithm  is  based  on  feedback  neural  networks  [4, 6]. 
A  neural  network  is  collection  of  simple  analog  process¬ 
ing  elements  designed  to  mimic  biological  neurons.  The 
computational  paradigm  provided  by  such  networks  is 
a  radical  departure  from  that  of  the  classical  von  Neu¬ 
mann  architecture.  The  “input”  to  such  a  network  is 
a  matrix  of  interconnections  among  the  processing  ele¬ 
ments,  together  with  an  initial  voltage  that  is  applied 
to  each  element.  The  networks  are  designed  so  that  the 
stable  output  voltages  of  the  analog  elements  are  binary. 
The  collection  of  all  binary  output  levels  is  then  inter¬ 
preted  as  the  “result”  of  the  computation. 

A  four  element  example,  from  [9],  is  shown  in 
Figure  2.  Neurons  are  represented  by  amplifiers,  each 
providing  both  standard  and  inverted  outputs  (voltage 
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Figure  1;  Ray-traced  image  resolved  by  ordered  dither. 


6i  €[-1,1]).  Synapses  are  represented  by  the  physical 
connections  between  input  lines  to  the  amplifiers  and,  in 
feedback,  output  lines  from  the  amplifiers.  Resistors  are 
used  to  make  these  connections.  If  the  input  to  amplifier 
i  is  connected  to  the  output  of  amplifier  j  by  a  resistor 
with  value  J2y,  then  the  conductance  of  the  connection 
is  Tij,  whose  magnitude  is  l/Rij  and  whose  sign  is  deter¬ 
mined  by  whether  the  connection  to  amplifier  j  is  from 
the  standard  or  inverted  output.  Hopfield  [6]  showed 
that  when  the  matrix  T  is  symmetric  with  zero  diagonal 
and  the  amplifiers  are  operated  in  “high-gain”  mode,  the 
stable  states  of  the  network  are  binary  ({-1,1})  and  are 
the  local  minima  of  the  computational  energy, 

N-lN-l  N-1 

E{e)  =  (-1/2)  Y,  E  -  E  •  (1) 

1=0  i=o  1=0 

Here  U  is  the  external  input  to  amplifier  t. 

Such  neural  networks  offer  a  natural  representa¬ 
tion  of  a  binary  display.  Each  pixel  is  represented  by  a 
neuron  that  is  connected  to  and  influenced  by  its  neigh¬ 
boring  pixels  (neurons).  A  network  binary  stable  state 
is  then  a  halftone  resolution. 

Several  authors  have  considered  applications  of 
neural  nets  to  digital  halftoning  [1,  3],  but  difficulties 
remain.  First,  there  is  no  natural  mapping  from  the 
desired  image  intensities,  the  V  matrix,  to  the  network 
parameters,  the  7</s  and  the  /j’s.  Reasonable  choices 
abound.  In  our  implementation  we  have  selected  a  sim¬ 
ple  but  easily  motivated  specification  for  these  values.  If 


we  scale  the  intensities  over  [-1,1]  by  letting  r,-  =  2Vi  —  1, 
then  our  choice  is  given  by 

li  =  Vi-C  Y 

jCnhbd(i) 

Tij  =  —Kij{2  —  |Ui  -1-  Vjl) 


where  C  and  Kij  are  non-negative  constants.  The  term 
Kij  depends  only  upon  the  mod  4  row  and  column  num¬ 
bers  of  pixels  t  and  j,  and  is  syimnnetric  in  t  and  j. 

Note  that  as  r,-  (and  hence  /,)  becomes  larger, 
it  becomes  more  important  to  turn  that  pixel  on  (set 
9i  =  1)  to  reduce  E  in  (1).  However,  there  are  some 
attenuating  factors.  From  (1)  we  see  that  (—1/2)7}.; 
can  be  viewed  as  the  strength  with  which  we  insist  that 
adjacent  pixels  assume  opposite  parity,  and  this  is  at  a 
maximum  when  the  underlying  intensities  are  of  equal 
magnitude  and  opposite  sign,  e.g.  one  black  and  one 
white  (vj  =  !,«;=  -1)  or  both  gray  («,•  =  Vj  —  0). 

We  should  also  note  that  some  control  over  av¬ 
erage  region  intensity  is  at  our  disposal.  If  we  let  m  = 
llCisA  <lenote  the  rounded  total  intensity  over 

region  R,  then  we  can  add  to  E{9)  a  sununand  of  the 


form 


c(£ 

•6H 


-b  1 
2 


-m)2 


where  C  >  0,  and,  to  maintain  a  zero-diagonal  T  matrix, 
another  of  the  form 


tefl 


Figure  2:  Four  element  feedback  network. 


The  result  is  equivalent  to  adding  C(m  —  |i2|/2)  to  each 
/,•  and  —Cjl  to  each  €  R),  and  the  net  effect 

is  to  force  m  of  |i{|  pixels  on.  Applied  on  a  global  scale, 
this  modification  lends  force  to  providing  a  resolution 
with  correct  average  intensity. 

A  second  difficulty  we  face  is  that  networks  of 
N  =  1024^  =  1, 048, 576  neurons  have  not  yet  been  built, 
and  we  must  resort  to  net  simulation,  which  can  require 
excessive  memory  space  and  execution  time. 

Net  simulation  is  traditionally  approached  (e.g. 
[9])  as  a  numerical  integration  of  the  system  of  N  differ¬ 
ential  equations  describing  the  operation  of  the  ampli¬ 
fiers  [6]: 

Cidui/dt  =  ^  Tijg(uj)  -  m/Ri  +  (2) 

J 

Here  the  u<  are  internal  input  voltages  to  the  amplifiers, 
and  are  related  to  the  desired  output  voltages,  the  6i, 
by  a  sigmoidal  gain  function,  g{x).  A  reasonable  choice 
for  g{x)  is  a  scaled  hyperbolic  tangent,  g{x)  =  tanh(Xx). 
Here  A  is  called  the  gain.  The  C,-  are  the  input  capac¬ 
itances  of  the  amplifiers,  and  Ri  =  l/(l/p-f 
where  p  is  amplifier  input  resistance. 

We  have  found  numerical  integration  of  large  (2*° 
neuron)  systems  of  the  form  (2)  to  be  extremely  time 
consuming,  and  therefore  have  developed  an  alternative 
approach.  Any  equilibrium  of  (2)  is  given  by 


that  is, 

Ui  =  -I-  If) 

J 

or  simply 

u  =  G(u), 

where  G(«)  =  diag(R)(Tg(u)  ■+■  I),  diag{R)  has 

Ri’s  on  the  diagonal  and  O’s  elsewhere,  and  p(u)  = 
(g(ui),  g(u2), ...).  Thus  we  seek  a  fixed  point  of  a  cer¬ 
tain  N-dimensional  function.  If  |  |  denotes  the  max 

norm  on  Euclidean  N-space  and  ||  ||  its  induced  matrix 

norm,  then  since  Ri  <  |r,j|  we  have 

|G(«)-G(u01  =  |(itap(i?)r(ff(u)  -  ff(u/))| 

<  ||diaff(i2)7’||  •  |p(u)  -  p(u/)| 

<  !?(«)  -  ffMl 

<  A|u  — u/|, 

where  the  last  inequality  follows  from  a  Taylor  expansion 
and  bounded  derivative  of  g.  Thus,  for  gain  A  <  1,  con¬ 
vergence  of  the  simple  iteration  scheme,  is 

straightforward  (see,  e.g.  [7]).  Unfortunately,  the  Hop- 
held  result  speaks  only  of  high-gain  operation,  and  we 
must  consider  A  >  1,  where  the  simple  iteration  is  likely 
to  diverge.  Fortunately,  there  is  an  intriguing  alterna¬ 
tive. 

In  [5]  Hillam  established  a  remarkable  result  for 
functions  on  the  real  line:  if  /  :  [a,6]  -+  [o,  6]  satisfies 
l/(i)  —  /(y)l  <  M\x  —  y|,  then  the  iteration  scheme 


0  =  ^  Ti,jg{uj)  -  Ui/Ri  ■+  li 
j 


®n+l 


1 


M+  1 


/(Xn)  + 


M 

A/ -1-1®” 


(3) 
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converges  to  a  fixed  point  of  /.  To  our  knowledge,  the 
copjecture  that  this  result  extends  to  higher  dimensions 
remains  unresolved. 

Nevertheless,  we  have  found  substantial  empirical 
evidence  to  support  it.  Using  (3)  with  M  =  A,  we  find 
that  convergence  to  a  fixed  point  of  G  (that  is,  average 
component  error  ju,-  —  G(u,)|  <  10“ usually  requires 
fewer  than  200  iterations.  We  have  not  found  a  net  for 
which  this  scheme  fails  to  converge. 

In  Figure  3  we  show  the  results  of  application  of 
this  algorithm  to  the  same  data  used  in  Figure  1.  We 
note  the  substantial  addition  of  specular  information. 
Spheres  contain  reflected  images  of  the  floor,  the  side 
walls  (including  the  extent  of  the  walls),  and  even  each 
other.  The  walls  also  contain  marked  reflections  of  the 
floor  and  the  mirror. 

3.  Implementation. 

A  sequential  iteration  on  a  floating  point  vector,  u,  of 
length  1,048,576  can  represent  an  enormous  computa¬ 
tional  expense,  and  a  parallel  implementation  is  highly 
desirable.  We  implemented  our  net  simulation  algorithm 
on  an  Intel  iPSC/2  hypercube  with  16  nodes,  each  with 
Intel  80386/80387  scalar  processors  and  8  megabytes  of 
memory.  Eight  of  the  hypercube  nodes  have  vector  pro¬ 
cessors  and  an  additional  megabyte  of  vector  memory. 

.  The  program  was  written  in  C  and  run  on  one  im¬ 
age  of  size  256  x  256  (Figure  4)  and  one  of  size  1024  x  1024 
(Figure  3).  Timing  results  of  these  runs  are  summarized 
in  Table  1. 

Four  versions  of  the  algorithm  were  developed:  a 
sequential  version  (S),  a  parallel  version  (PI)  running 
on  eight  scalar  nodes,  a  second  parallel  version  (P2)  on 
eight  nodes,  a  parallel  version  (P3)  on  sixteen  nodes,  and 
a  parallel/ vector  version  (PV)  running  on  eight  vector 
nodes.  The  large  (1024  x  1024)  image  (Figure  3)  was 
produced  using  P3.  All  other  programs  used  the  the 
smaller  (256  x  256)  image  (Figure  4)  for  data.  For  PI, 
P2,  and  PV,  input  data  are  distributed  evenly  among 
eight  or  sixteen  nodes,  a  node  receiving  an  equal  number 
of  rows  of  input  data.  Program  S,  of  course,  holds  all 
the  data  in  a  single  node’s  memory. 

Programs  S  and  PI  were  versions  designed  to  con¬ 
serve  memory.  Hence,  the  array  containing  the  pixel 
intensities  were  updated  in  place,  and  functions  (such 
as  g(u))  were  reevaluated  whenever  needed  (rather  than 
saved  in  temporary  arrays).  There  are  two  advantages  to 
this  approach:  (a)  less  memory  is  used  since  additional 
arrays  to  hold  new  values  temporarily  are  not  required, 
and  (b)  the  method  converges  with  fewer  iterations  since 
new  updated  values  are  used  immediately.  Two  disad¬ 
vantages  however  are:  (a)  vectorization  of  the  operations 
is  difficult,  and  (b)  the  algorithm  requires  multiple  eval¬ 


uations  of  the  (computationally  costly)  function  <7(u). 

Programs  P2  and  PV  update  each  pixel’s  inten¬ 
sity  using  only  old  values  of  its  neighbor’s  intensities. 
Separate  arrays  hold  the  new  u  and  ^(u)  values.  This  re¬ 
quires  more  memory  per  node  and  more  iterations  (com¬ 
pare  PI  and  P2).  However,  because  ff(u)  is  evaluated 
only  once  for  each  new  pixel  intensity  u,  total  number 
of  operations  and  overall  execution  time  are  less  for  P2. 
PV  is  a  vectorized  version  of  P2.  Additional  memory  is 
used  by  PV  to  make  vectorization  more  eflicient.  Vector¬ 
ization  is  possible  because  of  the  use  of  only  old  values 
of  u  to  compute  new  values. 

Note  that  each  complete  intermediate  iterate, 
can  be  regarded  as  a  halftone  resolution  and  dis¬ 
played,  thus  allowing  us  to  observe  the  convergence. 
In  Figure  4  we  show  intermediate  resolutions  of  the 
256  X  256  image  (a  digitized  photograph)  at  iterations 
Jk  =  4, 8, 12, 16, 24, 28, 32, 50, 100. 

The  ratio  of  execution  times  of  versions  S  and 
PI  shows  a  speedup,  due  to  eight-way  parallelism,  of 
6080/800  =  7.6,  with  an  efficiency  of  7.6/8  =  95%. 
Comparing  PV  and  P2,  the  executions  times  show  that 
the  improvement  factor  due  to  vectorization  is  approxi¬ 
mately  289/114  =  2.5. 

Table  1,  column  P3,  gives  the  time  required  to 
generate  the  1024  x  1024  pixel  image  shown  in  Figure 
3.  The  method,  like  P2  and  PV,  computes  the  new  u 
matrix  using  old  u  values.  This  version  is  not  vector¬ 
ized  because  vectors  were  available  only  on  eight  of  the 
sixteen  nodes  of  the  iPSC/2  used.  It  required  approxi¬ 
mately  176  iterations  taking  2579  seconds,  or  43.3  min¬ 
utes,  of  computation,  and  110  Mbytes  of  memory.  Had 
vectors  been  available  on  all  nodes,  we  estimate  that  a 
vectorized  version  would  improve  by  a  factor  of  2.5,  or 
require  2579/2.5  w  1032  seconds  or  17.3  minutes.  To 
the  best  of  our  knowledge.  Figure  3  is  the  only  neural 
network-generated  image  of  this  size. 

4.  Future  Work. 

We  are  currently  working  on  a  vectorized  version  of  P3. 
We  are  modifying  program  PV  in  an  attempt  to  reduce 
memory  requirements.  The  goal  is  to  fit  the  data  into 
the  memory  available  on  eight  vector  nodes.  A  paral¬ 
lel/vector  version  should  reduce  the  execution  time  by  a 
factor  of  Rs  2.5,  i.e.,  down  to  approximately  34.6  minutes 
using  eight  vector  nodes.  The  ability  to  view  a  new  im¬ 
age  every  35  minutes  should  greatly  facilitate  the  user’s 
quest  for  the  parameters  that  generate  the  most  accurate 
images.  We  also  continue  to  experiment  with  different 
parameters  K  and  C,  to  discover  relationships  between 
the  parameters  and  the  quality  of  specular  information 
obtained.  Finally,  we  are  currently  experimenting  with 
variations  of  the  neural  network  algorithm  used. 
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Figure  3:  Ray-traced  image  with  enhanced  specular  information. 


Version 

S 

PI 

P2 

PV 

P3 

Number  of  Nodes 

1 

8 

8 

8 

16 

Image  Size  (N) 

256 

256 

256 

256 

1024 

Execution  Time  (secs) 

6080 

800 

289 

114 

2579 

Memory  Required  (Kbytes  per  node) 

5720 

755 

892 

960 

6877 

Number  of  Iterations 

133 

134 

154 

154 

176 

Number  of  Operations  (millions) 

641 

641 

319 

319 

5810 

MFLOPS 

0.105 

0.801 

1.105 

2.809 

2.253 

Table  1:  Summary  of  results 
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Introduction 

The  tremendous  amount  of  data  contained  in  an  im¬ 
age  oftentimes  precludes  the  extraction  of  useful  in¬ 
formation  in  real-time  environments.  A  multiresolu¬ 
tion  representation  can  be  used  to  obtain  structural 
properties  of  a  single  image  or  sequences  of  images[8]. 
These  structural  properties  are  useful  for  such  opera¬ 
tions  as  texture  analysis,  image  segmentation,  object 
identification  and  stereo  matching[7,12,ll]. 

A  number  of  methods  have  been  suggested  to  ob¬ 
tain  a  multiresolution  representation  of  images.  The 
pyramidal  representation  of  Burt[l]  and  Crowley[3], 
the  morphological  filter  approach  of  Chen  and  Yan[2], 
the  multiscale  Gaussian  filter  of  Marr  and  Poggio[10], 
the  subband  coding  technique  of  Woods  and  0’NeiI[14], 
and  the  wavelet  representation  of  Mallat[9]  are  some 
examples.  The  wavelet  representation  offers  the  best 
basis  for  image  analysis,  since  the  orthonormal  basis 
guarantees  no  correlation  between  image  details  sam¬ 
pled  at  different  scales.  Correlation  found  in  the  Burt 
and  Crowley  pyramid  structure  and  the  Chen  and  Yan 
morphological  space  hamper  pattern  recognition  op¬ 
erations  due  to  difficulties  in  selection  of  appropriate 
distance  metrics.  In  addition,  the  wavelet  represen¬ 
tation  has  good  localization  properties  in  the  Fourier 
and  spatial  domains[9]. 

This  paper  presents  a  hypercube  algorithm  for  gen¬ 
erating  the  wavelet  representation  of  an  image  or  se¬ 
quence  of  images.  All  of  the  approximation  and  detail 
images  of  the  representation  can  be  obtained  simulta¬ 
neously.  An  experimental  study  is  performed  on  im¬ 
ages  using  the  1024  element  NCUBE  hypercube  at  the 
University  of  South  Carolina.  The  next  section  gives 
a  description  of  the  development  of  the  orthonormal 
wavelet  representation  for  images.  This  is  followed  by 
details  of  the  hypercube  implementation.  Finally,  the 


results  of  some  experimental  studies  are  shown. 


Wavelet  Representation 

An  image  can  be  considered  to  be  a  member  of  the 
function  space  L^(R  ).  Expansions  of  L^(R  )  func¬ 
tions  are  obtained  from  translations  and  dilations  of 
the  wavelet  functions  V’(i)[9],  where  n  is  the  dimen¬ 
sionality  of  the  vector  space.  The  multiresolution  ap¬ 
proximation  of  an  image  is  given  by  A2i/(*,y),  where 
j  <  1  and  f{x,y)  is  the  two-dimensional  gray-scale 
image.  The  operator  A23  projects  the  image  on  a  vec¬ 
tor  space  \2’  C  L^(R  ).  This  projection  space  has 
the  properties  of  causahty,  translation  invariance  and 
convergence  to  the  original  signal  as  j  — ►  -|-oo. 

The  orthonorm^d  basis  of  V2y  is  derived  from  a 
scaling  function  whose  Fourier  transform  is  a 

low  pass  filter  for  images.  This  means  that  the  opera¬ 
tor  A2J  is  equivalent  to  a  low  pass  filtering  of  the  image 
followed  by  a  uniform  sampling  at  the  resolution  2^ .  If 
the  Fourier  transform  of  the  scaling  function  is  given 
by: 

+  00  +00 

4(u,,v)=  ]J]Jff(2-’>w,2-fv), 

p=i t=i 

where 

-^oo  +00 

n=  — 00  00 

then  A2>/[i,y)  can  be  found  by  convolving  the  im¬ 
age  with  H  and  keeping  every  other  row  or  column 
in  a  two  pass  algorithm.  The  convolution  filter  H  is 
separable  which  makes  it  suitable  for  a  parallel  imple¬ 
mentation.  Since  the  function  is  separable  $(z,  y)  can 
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be  decomposed  as; 

«(jc,y)  =  4>(x)<t^{y). 

The  discrete  spatial  form  of  the  ^(x)  scaling  function 
can  be  found  using  a  truncated  series  approximation. 
If  the  series  is  truncated  at  n  =  4,  a  convolution 
filter  of  size  1x9  is  generated.  This  filter  is  shown  in 
Figure  1.  An  image  decomposition  on  the  hypercube 
will  have  to  pass  a  constant  number  of  four  rows  across 
processor  boundaries  in  order  to  apply  this  filter. 


POSITION 

Figure  1  0(x)  seeling  function 


The  difference  in  information  between  the  image 
at  resolution  2^'*'^  and  2^  is  called  the  detail  signal. 
This  detail  signal  is  given  by  the  orthogonal  projec¬ 
tion  of  the  original  signal  on  the  orthogonal  comple¬ 
ment  of  V2i  in  V2y+« .  The  orthonormal  basis  for  the 
detail  signal  is  given  by  O2}  and  is  generated  by  scal¬ 
ing  the  wavelet  $(z,  y)  by  2^  and  translating  to  a  grid 
with  a  proportional  spacing  of  2~^ .  The  detail  signal 
D2if{x,y)  is  obtained  by  convolving  the  image  with 
G,  where  G  is  the  mirror  filter  of  H,  followed  by  keep¬ 
ing  every  other  row  or  column  once  again  in  a  two  pass 
algorithm.  The  mirror  filter  basis  is  related  through 
the  expansion  coefficients  of  H  as 

g{n)  =  (-l)»-"/,(l-„). 

The  wavelet  associated  with  G  is  generated  using: 

^(w)  =  G(|)  ^(|). 

Application  of  this  wavelet  to  an  image  is  equivalent  to 
a  band  pass  filter  centered  at  the  origin  which  passes 
signals  in  the  frequency  range  of  -2x  to  — x  and  x  to 
2x.  The  spatial  version  of  this  wavelet  V’(x)  shown 


in  Figure  2.  The  original  discrete  image  f(x,y)  can 
be  reconstructed  using  the  information  contained  in 
the  iq>proximation  signal  A^-jf  and  the  three  detail 
signals  —J<j<  — 1- 


POSITION 

Figure  2  y  (x)  scaling  function 

The  next  section  contains  details  of  the  hypercube 
implementation  of  the  algorithm. 


Hypercube  Wavelet  Decomposition 

The  scaling  function  ^(z,  y)  can  be  decomposed  into 
the  product  0(z)^(y).  The  same  operation  can  be  per¬ 
formed  on  the  wavelets,  except  now  instead  of  a  single 
wavelet  as  there  was  in  the  one-dimensional  case,  there 
are  now  three.  These  three  are  given  by  : 

^\x,y)  =  ^(z)^(y) 

'9'^{x,y)='^{x)4>{y) 

=  V'(x)V'(y). 

This  form  of  the  wavelets  leads  to  three  detail  sig¬ 
nals  given  by  D\jf{x,y),  Dlif{x,y)  and  Dfi/(z,y). 
The  horizonttd,  vertical  and  corner  structure  of  the 
image  can  be  derived  from  these  detail  signak. 

Since  the  wavelet  transformation  is  sepEurable,  op¬ 
erations  on  the  rows  can  be  done  followed  by  oper¬ 
ations  on  the  colunms.  In  addition,  the  convolution 
with  H  can  be  done  in  parallel  with  the  convolution 
using  G.  The  following  five  step  algorithm  will  gen¬ 
erate  A2>  ,  ,  D\i  andD\j  from  the  original  unsealed 
image  A2j+«/(z,y). 
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HYPERCUBE  WAVELET  ALGORITHM 

Step  1:  Subdivide  image  into  JV/4  strips,  where  N  is 
the  total  number  of  processors.  Distribute 
these  strips  to  the  processors  such  that  the 
equivalent  of  four  full  images  have  been  sent. 

Step  2:  Convolve  the  rows  of  two  of  the  distributed 
images  with  G,  and  the  rows  of  the  other  two 
distributed  images  with  H. 

Step  3:  Sample  resulting  images  by  throwing  out  ev¬ 
ery  other  colunu). 

Step  4:  Convolve  the  columns  of  two  of  the  images 
from  Step  3  with  G,  and  the  columns  of  the 
other  two  distributed  images  with  H. 

Step  5:  Sample  resulting  images  by  throwing  out  ev¬ 
ery  other  row. 

The  algorithm  given  above  is  optimal  in  the  sense 
of  communication  in  that  only  shared  rows/columns 
need  be  communicated  between  nodes.  A  recent  study 
done  by  Jones  et  al  has  indicated  that  a  rectangular 
decomposition  with  the  number  of  rows  a  factor  of 
four  times  the  number  of  colunms  is  the  optimal  image 
decomposition  for  the  NCUBE  system[6]. 


Experimental  Studies 

In  order  to  test  the  efficiency  of  the  hypercube  map¬ 
ping  we  applied  the  algorithm  to  an  image  of  the  robot 
test  area  at  ORNL.  The  image  resolution  was  256  rows 
by  256  columns  with  256  leveb  of  gray.  The  image 
was  mapped  to  the  hyoercube  using  a  ring  mapping 
since  at  most  only  four  rows  of  information  needed  to 
be  transferred  between  processors.  If  the  filter  size 
is  enlarged  beyond  1x9,  the  added  image  information 
would  have  to  be  transferred.  The  original  image  is 
shown  in  Figure  3. 

The  hypercube  algorithm  described  above  allows 
the  approximation  signal  and  all  of  the  detail  signals  to 
be  derived  simultaneously.  Results  of  the  application 
of  the  algorithm  to  the  image  in  Figure  3  are  shown 
in  Figure  4.  The  display  format  used  here  is  to  place 
the  approximation  image  in  the  upper  left  hand  win¬ 
dow,  horizontal  component  image  in  the  upper  right 
window,  vertical  component  image  in  the  lower  left 
window  and  the  corner  component  image  in  the  lower 
right  hand  window  in  the  Figure.  Details  of  the  hori¬ 
zontal  and  vertical  edges  in  the  machine  cabinets  are 
preserved,  while  at  the  same  time  the  diagonal  stripes 
on  the  floor  are  apparent  in  all  of  the  detail  images. 


Sensitivity  to  noise  is  evident  in  the  center  of  the  cor¬ 
ner  image  in  the  lower  right  hand  window  of  Figure 
4. 


Figure  3  Original  image 


Figure  4  Wavelet  decomposition 


Since  the  communication  is  of  constant  size,  ef¬ 
ficiency  of  the  algorithm  should  be  almost  constant. 
Efficiency  is  plotted  versus  dimension  of  the  subcube 
for  the  image  used  in  the  study  in  Figure  5.  Up  to  a 
size  seven  subcube  conrununication  will  only  be  with 
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nearest  neighbors  in  the  ring.  However,  since  each  pro¬ 
cessor  only  contains  two  rows  in  the  size  seven  subcube 
mapping,  two  transfers  are  necessary  to  get  the 


Subcube  Dimension 
Figure  5  Algorithm  performance 


needed  four  rows  of  information.  This  effect  is  ap¬ 
parent  in  the  size  seven  subcube  efficiency  of  87%.  The 
total  time  taken  for  the  wavelet  transform  at  this  reso¬ 
lution  subcube  was  89.7  milliseconds,  giving  an  equiv¬ 
alent  double  precision  floating  point  rate  for  all  four  of 
the  wavelet  image  generations  of  54.8  Megaflops.  This 
timing  bench  includes  the  row  transfers  as  well  as  the 
type  casting  from  char  to  double  and  double  to  char 
necessary  due  to  memory  limitations  on  the  nodes. 


Conclusions 

We  have  presented  a  parallel  version  of  the  wavelet  de¬ 
composition  algorithm  for  images.  The  algorithm  ex¬ 
ploits  the  SPMD  nature  of  the  NCUBE  hypercube  and 
allows  further  pattern  recognition  to  be  done  with  the 
horizontal,  vertical  and  corner  information  being  resi¬ 
dent  on  the  nodes.  Efficiency  of  the  algorithm  is  quite 
good  up  to  a  size  seven  subcube,  with  a  performance 
of  87%  at  that  dimension.  The  degradation  in  perfor¬ 
mance  at  this  resolution  would  be  offset  by  a  larger 
image,  since  more  rows  would  be  contained  within  the 
memory  of  a  single  processor.  Since  the  algorithm  was 
designed  for  SPMD  mode,  it  should  port  quite  easily 
to  a  SIMD  machine.  The  port  to  the  SIMD  MP-1 
machine  manufactured  by  MasPar  Corporation  was 
straight-forward  and  the  results  of  timing  runs  indi¬ 
cate  a  performance  of  108  double  precision  Megaflops 
on  a  4096  node  version.  The  execution  time  was  31.7 


milliseconds,  which  indicates  that  the  algorithm  can 
be  used  for  real-time  image  analysis. 

We  are  presently  extending  the  analysis  to  a  three 
dimensional  wavelet  decomposition  of  a  sequence  of 
images,  where  the  third  dimension  is  time.  Motion 
analysis  using  region  features  has  been  previously  ex¬ 
amined  at  the  full  image  scale[13],  and  the  wavelet 
representation  offers  the  possibility  of  a  parallel  ex¬ 
amination  of  the  time  evolution  of  multiple  features 
derived  from  a  segmentation  analysis  using  a  paral¬ 
lel  self-organizing  feature  map[5].  In  addition,  Mallat 
has  shown  that  texture  analysis  can  be  done  with  the 
wavelet  representation  using  the  fractal  dimension  de¬ 
rived  from  the  power  function  spectra[9].  This  type 
of  analysis  can  be  merged  with  the  fractal  signature 
approach  [4]. 
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Abstract 

One  important  reason  for  the  lack  of  acceptance  of 
cwonary  arteriography  techniques  is  the  four  hours 
(approximate)  required  for  accurate  3D 
quantification  of  an  arterial  tree,  thus  making  it 
prohibitive  for  the  clinician  to  utiilze.  Parallel 
processing  techniques  can  greatly  speed  up  the 
image  processing  and  analysis  of  3D  arterial  trees. 
It  has  been  demonstrated  in  this  project^  that  the 
reconstruction  of  the  three  dimensional  image  and 
arteriographic  measurements  can  be  made  close  to 
real  time  using  these  techniques. 

1.  Problem  Description 

Coronary  arteriography  is  currently  the  standard 
technique  for  evaluating  the  condition  of  the 
coronary  arteries.  This  procedure  not  only 
determines  the  need  for  revascularization[l],  but 
also  the  degree  of  success  of  surgery  involving 
angioplasty  and  bypass  grafting.  Currently, 
coronary  artery  reconstruction  is  performed  on 
several  sets  of  Digital  Substiaction  Angiography 
images  from  different  patients.  In  each  case,  the 
images  are  obtained  from  a  Siemens  Angioscope  D 
interfaced  to  a  Digitron  II  digital  image  acquisition 
system.  This  system  can  currently  acquire  512  by 
512  pixel  images  at  30  frames  per  second 
simultaneously  with  conventional  cine  film 
acquisition.  Reconstruction  requires  at  least  two 
views  of  known  orientation.  For  each  of  the  two- 
view  images,  the  arteries  are  opacified  with  an 
iodonated  contrast  medium,  as  a  sequence  of  x-ray 
images  are  acquired. 

The  flow  of  information  is  shown  in  Figure  1. 
The  3D  reconstruction  requires  that  the  view 
geometries  be  known  as  accurately  as  possible. 
Corrections  due  to  geometric  misalignment[2],  x- 
ray  scattering,  artifacts  etc.  have  to  be  made  before 
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accurate  view  geometries  are  realized.  The  first  stq) 
for  obtaining  these  geometries  is  segmentation  of 
the  arteries.  This  requires  an  cqierator  to  specify  the 
branch  points  and  other  recognizable  points,  also 
referred  to  as  the  node  points.  This  interaction 
establishes  the  original  hierarchy  and  the 
correspondence  of  node  points  between  views.  It 
also  divides  the  original  arterial  tree  into  smaller 
artery  segments  or  branches[4].  The  next  step  is 
the  determination  of  vessel  cent^line  and  edges  of 
each  segment  in  each  view,  which  proceeds  without 
operator  intervention[6].  The  2D  geometry 
coordinates  of  each  segment  are  mapped  onto  the 
3D  reconstructed  geometry[3,5,7).  The  3D 
reconstruction  becomes  more  accurate  as  the 
number  of  views  of  x-ray  angiograms  are  increased. 

The  above  mentioned  operations  are  very 
computationally  expensive.  Moreover  as  the 
number  of  views  increases,  the  time  complexity  for 
the  algorithm  increases  markedly.  The 
implementation  of  a  multi-processor  based 
workstation  utilizing  parallel  programming 
techniques  will  not  only  reduce  the  time  for 
coronary  reconstruction,  but  also  make  the  three 
and  four  view  reconstruction  a  reality  for  even  more 
accurate  results. 

2.  Description  of  Existing  Algorithms 

The  existing  sequential  algorithms  have  been 
implemented  on  a  VAX-780  computer  to  provide 
3D  reconstruction  of  coronary  arteries.  Input 
images  are  obtained  from  two-view  ECG  correlated 
x-ray  angiograms.  A  target  sLiicture  consisting  of 
the  node  points  is  entered  on  the  first  of  a  sequence 
of  images[3,4]  in  one  view  using  the  Digitron  n. 
Automatic  edge  detection  is  used  to  detect  the 
centerline  and  edges  of  the  vessels[7].  The  edge 
detection  algorithm  represents  the  computationally 
intensive  part  of  the  software  and  is  used  many 
times  in  the  course  of  the  reconstruction.  Angle- 
corrected  densitometric  calculations  are  used  to 
refine  the  vessel  cross  section[5].  3D 
reconsuuction  is  completed  by  a  distance 
minimizing  point  matching  technique. 
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Figure  1:  Information  Flow  in  3-D  Coronary 
Artery  Reconstruction 


Based  on  an  operator  entered  target  structure,  vessel 
centerlines  and  edges  are  detected  in  an  automated 
manner  throughout  the  heart  cycle  in  each  view. 
This  centerline  and  edge  information  in  2D  is  used 
to  obtain  the  3D  representations  of  the  arterial  bed 
throughout  the  coronary  cycle.  The  extraction  of 
the  image  information  into  sub-image  is  performed 
orthogonal  to  the  idealized  search  target  such  that 
the  rows  of  the  extracted  matrix  consist  of  image 
values  along  this  orthogonal  path.  Edge  features 
are  then  enhanced  using  matched  filter  convolution 
to  create  a  likelihood  matrix.  The  dynamic  search 
software  module  is  then  used  to  map  the  optimal 
global  path  through  this  matrix.  This  technique  is 
applied  on  a  segment  or  a  portion  of  the  entire 
image  and  is  repeated  to  process  all  the  other 
segments  in  the  image.  Typically  each  view  will 
have  about  20  segments.  The  resultant  daia- 
structure  consists  of  a  set  of  2D  edges.  The 


densitometric  cross  section  measurements  are 
obtained  corresponding  to  each  point  along  the 
vessel  segment. 

The  3D  reconstruction  is  obtained  from  each  view 
using  geometry  information  of  the  2D  arterial  tree 
or  plane  uee  as  shown  in  Figure  1.  For  two  or 
more  views  the  problem  is  reduced  to  finding  the 
intersection  of  a  projection  line  representing  the  x- 
ray  path  for  each  element  From  Ae  3D  centerline 
the  orientation  of  each  vessel  segment  relative  to 
each  projection  is  computed.  The  area  of  cross 
section  is  computed  for  all  elements  of  all 
segments  using  the  orientation  corrected  plane  tree 
measurements.  Flow  characteristics  of  the  blood 
are  obtained  from  transit  time  measurements  of  the 
leading  edge  of  the  iodine  bolus  passing  through 
the  artery  bed.  The  reconstructed  image  can  be 
displayed  on  a  high  resolution  graphics  monitor. 

3.  Analysis  of  the  Computation- 
Intensive  Routines 

The  convolution  and  dynamic  search  algorithms  are 
invoked  multiple  times  during  the  course  of  the 
complete  3D  reconstruction.  The  convolution 
operation  is  invoked  each  time  edge  or  centerline 
detection  has  to  be  performed.  The  edges  and  vessel 
centerline  in  2D  have  to  be  calculated  before  any 
3D  reconstruction  can  be  done.  The  dynamic  search 
algorithm  is  also  called  as  frequently  as  the 
convolution  software.  For  a  two-view 
reconstruction  of  the  coronary  artery  structure  the 
number  of  branches  to  be  calculated  will  be  about 
1200  per  second  of  x-ray  image  data,  assuming  that 
a  total  of  30  x-ray  images  are  taken  per  second. 
For  each  such  branch  to  be  represented 
geometrically,  the  convolution  and  dynamic  search 
has  to  be  tqiplied  once  for  each  edge  and  once  for 
the  centerline.  There  are  instances  where  the 
operator  may  decide  to  recalculate  the  vessel 
geometry  as  it  may  need  further  refining  These 
algorithms  could  be  called  a  total  of  3600  times  for 
a  total  global  reconsuuction  of  the  plane  tree  for 
one  heart  cycle.  It  is  quite  important  that  the  above 
mentioned  software  is  speeded  up  to  provide  an 
overall  improvement  in  performance.  The 
complexity  analysis  of  the  above  algorithms  is 
examined  for  this  purpose. 

Time  complexity  of  sequential  convolution 

Ti(t)  =  0(n*m*k) 

where  n  is  the  number  of  rows  and  m  is  the  number 
of  columns  of  the  extracted  matrix,  and  k  is  the 
number  of  elements  in  the  filter  (kernel). 
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Time  complexity  of  sequential  dynamic  search 
T2(t)  =  0(n*m*s) 

where  s  is  the  step  size  for  the  search  window  for 
dynamic  search. 


4.  Implementation  on  the 

Multiprocessor  Workstation 

Tie  Transputer  based  system  consists  of  multiple 
processing  nodes  arranged  in  a  hypercube  network. 
The  processor  in  this  case  is  the  INMOS  T-800 
Transputer  [8,9].  These  processors,  available  on  a 
card  as  a  group  of  four,  can  be  easily  installed  onto 
the  IBM-AT  bus.  One  very  important  reason  for 
selecting  the  Transputer  is  that  it  provides  one  of 
the  best  cost  vs.  performance  characteristic  of  any 
other  distributed  memory  parallel  processing 
system.  It  can  also  be  easily  added  onto  a  desktop 
personal  workstation.  Two  methods  of  parallelism 
were  implemented  in  this  feasibility  study. 

•  The  first  is  the  fine  grain  method,  where  data 
partitioning  is  done  for  one  branch  of  the 
extracted  data  to  be  processed  at  a  time, 

•  The  second  is  the  coarse  grain  method,  where 
there  is  segment  parallelism  and  one  segment 
being  processed  per  processing  node. 

4.1.  Fine  Grain  Method 
The  first  and  most  important  requirement  for 
parallel  processing  is  to  map  the  algorithms  for  the 
problem  to  be  solved  onto  each  processor.  The 
efficiency  and  speed  up  of  the  solution  are  enhanced 
sufficiently  by  selecting  a  good  strategy  for 
mjqpping  this  problem. 

4.1.1.  Manning.  The  convolution 
algorithm  convolutes  a  matched  filter  with 
elements  of  the  extracted  matrix.  This  operation 
performs  the  convolution  row  by  row  such  that 
each  element  of  the  matched  filter  is  multiplied 
with  each  element  of  each  row  of  the  extracted 
matrix.  This  algorithm  can  be  parallelized  to  run 
on  nonoverlapped  subsets  of  data  selected  from  the 
complete  extracted  matrix  and  then  loaded  on  the 
various  processors  with  no  communication  during 
the  course  of  the  convolution. 

The  only  communication  among  processors  is  for 
the  distribution  and  subsequent  collection  of  the 
convolved  sub-matrices.  The  partitioning  of  this 
data  is  done  in  the  form  of  vertical  strips  as  shown 
in  Figure  2.  Tie  number  of  columns  are  divided 
equally  among  all  the  processors,  and  each  has  an 


almost  equal  number  of  elements  of  the  matrix. 
Each  processor  then  reads  its  partition  of  the 
complete  matrix  in  its  local  memory  to  start  with 
the  convolution.  At  the  end  of  the  convolution  the 
resultant  matrices  are  available  for  parallel  dynamic 
search.  With  this  scheme  of  partitioning  of  the 
data,  there  is  an  almost  linear  speed  up  in  these 
algoithms.  The  ov^ead  of  mapping  is  the  matrix 
distribution  before  convolution  and  recombination 
of  the  sub-matrices  after  the  dynamic  search.  This 
represents  a  fine  grain  approach  to  parallelism  as 
^ptied  to  this  particular  problem. 

Each  processor  will  have  a  partition  of  size  n  rows 
by  (mip)  columns,  where  p  is  the  number  of 
Iffocessors  in  the  workstation. 

Time  complexity  for  the  mapping  of  the  data 

T3(t)  =  0(p) 

Mapping  for  dynamic  search  can  only  be  along 
vertical  strips.  In  this  case,  the  data  partitions  have 
to  be  a  division  of  the  number  of  columns  only, 
because  the  search  proceeds  vertically  and  proper 
load  balancing  can  only  occur  if  each  processor  is 
kept  busy  with  almost  the  same  work  load.  The 
search  starts  at  the  last  row  and  builds  up  one  row 
at  a  time,  till  it  reaches  the  first  row.  If  the 
division  had  been  along  horizontal  strips,  then  it 
would  result  in  only  one  processor  executing  the 
search  operation  at  one  time  and  then  passing  its 
boundary  path  direction  to  its  neighbor  processor. 
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Figure  2:  Vertical  Strip  Mapping  for  m  by  n 
columns  on  p  processors. 
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4.1.2.  Parallel _ Convolution.  The 

extraction  of  image  information  into  the  sub-image 
is  application  dependent  but  shares  the  common 
feature  that  pixels  are  extracted  orthogonal  to  an 
idealized  search  target.  At  the  present  time  the 
initial  target  is  a  fixed  radius  circle  and  the  sub¬ 
image  information  is  extracted  orthogonal  to  this 
circle.  The  operator  specifies  a  search  target  along 
which  the  extraction  occurs.  For  coronary  artery 
tracking  the  target  is  specified  as  a  set  of  node 
points  that  are  connected  by  segments.  The 
individual  sub-image  matrices  have  to  be  convolved 
so  that  the  image  features  are  enhanced.  The  above 
mentioned  convolution  operation  produces  a 
Structure  Likelihood  Matrix  (SLM).  The 
importance  of  the  algorithm  is  in  the  structure 
enhancement  of  subtracted  images  using  a  ID 
gradient  density  matched  filter.  The  filter  elements 
are  application  specific  and  are  selected  on  the  basis 
of  the  densitometric  profile  of  the  structure  being 
tracked.  One  pass  of  the  convolution  algorithm 
enhances  the  features  corresponding  to  one  edge. 
The  filter  or  kernel  is  reversed  left  to  right  for 
detection  of  the  other  edge.  Finally  a  center  finding 
filter  is  used  fcv  finding  the  vessel  centerline. 

The  convolution  algorithm  can  operate  on  a  subset 
of  the  extracted  sub-image  data.  The  matrix  is 
partitioned  in  equal  parts  so  each  processor  can 
work  on  its  partition  independently.  This  parallel 
processing  technique  provides  an  overall  speed  up 
of  this  operation.  In  fact,  the  speed-up  is  quite 
linear  and  corresponds  to  the  ideal  maximum  limit, 
over  p  processors  where  p>  7,  as  detailed  in  the 
results. 

Time  complexity  of  parallel  convolution 
T4(t)  =  0(n*m/p*k) 

where  (mip)  columns  of  the  matrix  will  be 
convolved  simultaneously 

4.1.3.  Parallel  Dynamic  Search.  The 
Structure  Likelihood  Matrix  (SLM)  represents  the 
enhanced  image  features  for  the  extracted  matrix. 
The  higher  the  magnitude  of  the  element  of  the 
SLM,  the  greater  the  probability  of  it  being  present 
in  the  image.  The  dynamic  search  algorithm  fmds 
the  path  through  the  SLM  for  the  edge  of  a 
segment  and  builds  up  on  greater  number  of  such 
segments  for  a  global  solution.  The  path  direction 
of  each  element  in  the  bottom  row  is  selected  from 
a  window  of  size  equal  to  twice  the  path  width. 
After  the  direction  for  each  element  in  a  row  is 
selected,  the  magnitude  of  the  element  is  added  to 
the  magnitude  of  element  it  points  to,  for  an 
updated  set  of  row  elements.  The  path  direction  is 


repeatedly  calculated  with  these  updated  elements. 
The  path  is  traced  in  this  method  for  all  the 
elements  of  each  row  for  all  the  rows.  At  the  end  of 
this  operation  the  top  row  will  have  the  cumulative 
magnitude  for  the  path  traced  for  each  element  in 
that  row.  having  started  at  the  bottom  row.  The 
element  with  the  largest  magnitude  in  the  top  row 
will  represent  the  most  likely  starting  point  of  the 
path.  The  path  traversed  by  this  most  likely 
element  is  traced  by  following  the  direction  vectors 
through  all  the  rows  for  an  edge  of  the  artery  or 
organ. 

The  vertical  strips  are  mapped  onto  the  processors 
so  that  the  computation  envelope  is  an  array  of 
processors,  one  per  partition,  as  shown  in  Figure  2. 
The  step  size  or  the  search  window  provides  the 
minimum  and  maximum  limit  for  the  direction 
vector  of  the  path  for  an  element.  The  elements  at 
the  boundary  columns  of  the  partitioned  sub¬ 
matrices  on  a  particular  processor  need  to 
communicate  with  the  neighboring  processor,  that 
has  an  overlap  of  it's  search  window.  For  example, 
assume  that  the  element  at  column  (mJp)  and  row 
(n-1 )  as  shown  in  Figure  2,  has  a  step  size  of  2 
elements,  such  that  it  has  a  path  extending  2 
elements  on  the  left  and  right  directions 
respectively.  As  this  element  falls  on  a  partition 
boundary,  it  will  need  to  access  two  elements  on 
the  left  adjacent  columns  which  are  in  the  partition 
loaded  on  processw  0.  Likewise  the  right  boundary 
elements  of  elements  on  processor  0  will  have  to 
access  elements  of  its  search  window  extending  to 
the  left  hand  columns  of  processor  1.  Thus  at  each 
iteration  there  is  the  overhead  of  communication 
among  processors.  The  total  number  of  elements 
that  have  to  be  communicated  to  a  neighbor  at  the 
end  of  each  iteration  of  dynamic  search  are: 

Nc  =  [n*(p-2)*w] 

where  w  is  the  path  width  of  the  search  window. 

4.2.  Coarse  Grain  Method 
An  alternate  method  for  parallel  implementation  is 
by  adopting  a  coarse  grain  approach.  The  edge 
detection  involving  convolution  and  dynamic  search 
operations  is  performed  on  multiple  branches 
simultaneously.  With  this  approach,  p  branches 
can  be  processed  at  the  same  time  on  each  of  the  p 
processors,  with  one  branch  per  processor.  The 
branch  image  data  cixresponding  to  each  segment  is 
loaded  on  each  processor,  such  that  there  is 
absolutely  no  interaction  between  processors  during 
the  convolution  and  dynamic  search  operations.  At 
the  end  of  the  above  operations  the  information 
relating  to  the  two  edges  and  the  centerline  of  the 
vessel  is  stored  in  the  data  structure  of  the  plane 
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tree,  to  be  used  in  the  3D  reconstruction  operation. 
There  are  about  20  branches  per  view  and  so  an 
array  of  about  40  processors  can  be  utilized 
simultaneously  for  all  the  arterial  tree  data  from 
two  views. 

The  computation  load  on  each  processor  is  now  a 
function  of  the  size  of  the  branch  matrices.  The 
processors  are  performing  edge  detection  algorithms 
in  an  asynchronous  manner.  There  can  be  the 
situation  where  one  processor  will  complete  the 
detection  before  another  processor.  This  processor 
will  not  have  to  wait  for  computation  on  all  other 
processors  to  be  completed,  if  there  is  other 
computation  like  3D  geometry  computation  to  be 
performed,  else  it  has  to  let  other  processors  catch 
up.  Thus,  the  best  performance  can  be  expected  in 
the  condition  that  the  number  of  branches  are  equal 
to  the  number  of  processors,  and  each  branch 
matrix  is  of  the  same  size. 


5.  Evaluation  of  Results 

5.1.  Fine  Grain  Method 
The  parallel  convolution  algorithm  provides 
performance  directly  proportional  to  the  number  of 
the  processors.  There  is  a  linear  correlation 
between  the  number  of  processors  and  the  relative 
speed  up,  as  shown  in  Figure  3  and  Figure  4. 
Relative  speed  up  5  can  be  defined  as: 

^  _  Timetakenonl  processor 
Timetakenon  p  processors 

Figure  3  shows  the  results  of  the  fine  grain 
method.  For  parallel  convolution  in  this  method, 
there  is  no  interaction  among  processors  after  the 
sub-matrices  are  loaded  on  the  processor,  and  these 
matrices  have  no  overlapped  elements.  Moreover 
the  convolved  matrices  can  be  recombined  quite 
inexpensively.  Thus  the  perfect  speed  up  for  this 
algOTithm  is  obtained  during  benchmark  studies. 

The  parallel  dynamic  search  has  the  overhead  of 
communicating  the  elements  corresponding  to  the 
path  width  at  the  boundaries  of  the  sub-matrices. 
Typical  path  width  w  is  1  or  2  elements.  The 
benchmark  has  been  implemented  for  a  path  width 
of  2  elements.  There  is  some  amount  of  overlap 
between  the  communication  and  computation  as 
applied  to  this  case,  such  that  as  a  processor  is 
communicating  to  all  its  neighbors,  it  starts  up  the 
subsequent  computation  after  some  communication 
set  up  time.  The  performance  for  parallel  dynamic 
search  decays  as  this  communication  increases. 
The  processors  are  used  as  a  linear  array,  so  there  is 


bi-directional  communication  of  elements. 

The  fine  grain  method  consists  of  the  combination 
of  parallel  convolution  and  parallel  dynamic  search 
as  applied  to  only  one  branch  at  a  time.  There  is 
initial  overhead  in  performing  the  mapping  of  the 
various  partitions  of  data  on  the  processors.  Due  to 
the  most  efficient  method  of  p^tioning  selected 
for  this  method,  there  is  a  minimal  overhead  in 
comparison  to  the  overall  time  for  solution.  This 
method  provides  a  very  good  load  balancing  of  the 
array  of  processors.  The  overall  performance  with 
this  method  is  almost  linear  for  up  to  4  processors. 

There  is  a  slight  decay  in  the  speed-up  characteristic 
curve  at  p  =  4  because  of  the  overhead  in 
communication  in  dynamic  search  as  shown  in 
Figure  3.  It  is  apparent  that  the  performance  is 
quite  impressive  and  closely  follows  the  theoretical 
limit  The  analysis  was  performed  for  a  test  case  of 
simulated  edge  data.  The  matrix  was  a  2S6  by  2S6 
representation  of  a  sinusoid  like  edge.  The 
benchmark  was  calculated  for  1, 2  and  4  processors 
respectively.  It  took  14, 7  and  4  seconds  for  the  1, 
2  and  4  processor  cases  respectively.  These 
timings  do  not  include  the  time  to  read  the  matrix 
data  from  the  hard  disk-drive,  which  is  2  seconds.. 
The  results  can  be  extrapolated  for  more  processors 
for  almost  linear  speed  up.  This  method  works 
well  for  large  siz^  data  sets,  but  performance 
diminishes  as  the  partition  size  decreases.  Thoe  is 
communication  involved  as  each  row  of  the  matrix 
is  searched.  As  more  processors  are  added,  the  size 
of  the  partition  will  be  small  enough  that  the  time 
for  computation  will  be  less  than  the  time  for 
communication  and  so  the  solution  will  be 
communication  driven.  At  this  point  or  close  to 
this  point,  the  performance  will  decrease  and  not  be 
directly  proportional  to  the  number  of  processors. 

5.2.  Coarse  Grain  Method 
This  method  applies  convolution  and  dynamic 
search  on  a  complete  exuacted  matrix  ctxresponding 
to  a  branch  of  the  arterial  tree.  This  method 
provides  performance  which  is  linear  fcxr  a  total  of  p 
branches  applied  to  p  processors  as  shown  in 
Figure  4.  The  x-ray  data  is  segmented  into 
unrelated  subsets.  The  extraction  of  these  sub¬ 
matrices  and  then  the  subsequent  detection  of  edges 
for  one  bnmch  is  completely  disjoint  from  the  next, 
and  so  can  be  performed  independently  on  different 
processors  of  the  system.  There  could  potentially 
be  about  800  such  segments  to  be  processed  over  a 
complete  coronary  cycle.  It  is  apparent  that  an  800 
processor  system  can  run  on  800  segments  to 
generate  the  geometry  for  a  plane  tree  and  then 
proceed  with  this  coarse  grain  approach  for  3D 
reconstructed  set  of  branches. 
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The  x-ray  images  were  256  rows  by  256  columns 
of  data.  The  arterial  tree  was  segmented  into  sub¬ 
matrices  for  separate  branches.  The  size  of  the 
largest  extracted  data  matrix  for  a  branch  in  this  test 
was  122  rows  by  60  columns,  and  the  rest  were 
smaller  in  size.  The  benchmark  analysis  was 
performed  for  a  maximum  of  4  processcvs  and  so  4 
branches  were  selected  from  the  segmented  set  of 
branches  of  the  tree.  The  time  taken  for  detecting 
both  edges  of  this  segment  was  3  seconds.  Thus 
the  time  for  obtaining  the  edges  for  all  4  segments 
was  3  seconds.  In  comparison  a  single  processor 
based  computer  system  would  have  taken  about  12 
seconds  to  perform  the  same  computation.  As 
another  case  for  observation,  consider  the 
computation  for  edge  detection  of  32  segments:  the 
time  taken  by  the  coarse  grain  method  will  still  be 
about  3  seconds,  while  the  time  taken  on  a  single 
processor  computer,  with  same  processing  power, 
will  be  about  96  seconds.  There  is  a  clear 
advantage  in  cost  vs.  performance  in  this  apivoach, 
as  applied  to  3D  coronary  reconstruction. 
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Figure  3:  Fine  Grain  Method 
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Figure  4:  Coarse  Grain  Method 
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Abstract 

Recently  multigrid  techniques  have  been  proposed  for 
solving  low-level  vision  problems  in  optimal  time  (i.e. 
time  proportional  to  the  number  of  pixels).  In  the 
present  work  this  method  is  extended  to  incorporate 
a  discontinuity  detection  process  cooperating  with  the 
smoothing  phase  on  all  scales.  Activation  of  line  el¬ 
ement  detectors  that  signal  the  presence  of  relevant 
discontinuities  is  based  on  information  gathered  from 
neighboring  points  at  the  same  and  different  scales. 

Because  the  required  computation  is  local,  paral¬ 
lelism  can  be  profitably  used.  A  mapping  of  the  re¬ 
quired  data  structure  onto  a  two  dimensional  mesh 
of  processors  is  suggested.  Domain  decomposition  is 
shown  to  be  efficient  on  MIMD  computers  capable  of 
containing  many  individual  cells  in  each  processor. 

Some  examples  of  the  proposed  multiscale  solution 
techniques  are  shown  for  two  different  applications.  In 
the  first  case  a  surface  is  reconstructed  from  first  deriva¬ 
tive  information  (extracted  from  the  intensity  data),  in 
the  second  case  from  noisy  depth  constraints. 

1  Introduction 

In  the  last  years  a  sound  scientific  basis  has  been  given 
to  low  and  intermediate  level  vision  that  decodes  in¬ 
formation  about  three-dimensional  surfaces  and  their 
properties. 

Subsequent  visual  processing  can  be  facilitated  if  the 
different  constraints  are  transformed  into  a  visible  sur¬ 
face  representation  that  unambiguously  specifies  sur¬ 
face  shape  at  every  image  point. 

It  is  practically  very  hard  to  recognize  an  object  in  a 
visual  scene  unless  one  knows  how  to  choose  the  subset 
of  evidence  that  derives  from  the  same  object.  Hence 
discontinuities  are  necessary  both  to  avoid  washing 
away  important  information  under  the  smoothness  re¬ 
quirement,  and  to  provide  a  primitive  perceptual  or¬ 
ganization  of  the  visual  input  into  different  elements 
loosely  related  to  the  human  notion  of  objects.  In 
some  schemes  the  smoothing  and  discontinuity  detec¬ 
tion  steps  are  done  at  different  times,  but  there  is  a  gen¬ 
eral  suggestion  that  both  should  be  done  at  the  same 


time,  since  the  first  is  hiding  evidence  used  by  the  sec¬ 
ond  (llj. 

Psychophysics  and  practical  implementations  (see  for 
example  [1,6,10])  show  that  the  early  vision  step  can  be 
done  in  parallel.  Many  computational  units  (neurons  or 
processors)  cooperate  to  reach  the  desired  solution  with 
a  speed-up  roughly  proportional  to  their  number. 

In  the  following  first  the  multigrid  method  with  dis¬ 
continuities  is  briefly  described,  second  the  paralleliza¬ 
tion  strategy  is  outlined  and  finally  some  results  are 
presented. 

2  Multigrid  Method  with  Dis¬ 
continuities 

Early  vision  can  be  considered  “inverse  optics”,  since 
its  purpose  is  to  undo  the  image  formation  process,  re¬ 
covering  the  properties  of  visible  3-D  surfaces  from  the 
2-D  array  of  image  intensities.  In  general  the  class  of 
admissible  solutions  is  restricted  by  introducing  a  priori 
knowledge.  In  the  regularization  method  the  desired 
or  plausible  properties  are  enforced  when  the  inversion 
problem  is  transformed  into  the  minimization  of  a  func¬ 
tional  [12]. 

The  stationary  points  of  the  functional  are  found 
by  solving  the  Euler-Lagrange  partial  differential  equa¬ 
tions. 

In  standard  methods  for  solving  PDEs,  the  problem 
is  first  discretized  on  a  finite  dimensional  approximation 
space.  The  very  large  algebraic  system  obtained  is  then 
solved  using  for  example  “relaxation”  algorithms,  which 
are  local  ^  and  iterative. 

By  the  local  nature  of  the  relaxation  process,  solution 
errors  on  the  scale  of  the  solution  grid  step  are  corrected 
in  a  few  iterations;  on  the  contrary  larger-scale  errors 
are  corrected  very  slowly.  Intuitively,  in  order  to  correct 
them,  information  must  be  spread  over  a  large  scale  by 
the  “sluggish”  neighbor-neighbor  influence.  A  larger 
spread  of  influence  per  iteration  demands  large-scale 
connections  for  the  processing  units,  i.e.  a  solution  of 
the  same  problem  on  a  coarser  grid. 

‘The  local  structure  is  essential  for  efficient  use  of  parallel 
computation. 
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The  pyramidal  structure  of  the  multigrid  solution 
grids  is  illustrated  in  Figure  1. 
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Figure  1;  Pyramidal  structure  for  multigrid  algorithms. 

This  simple  idea  and  its  realization  in  the  multi-grid 
algorithm  not  only  leads  to  asymptotically  optimal  so¬ 
lution  times  ( i.e.  convergence  in  0(n)  operations)  but 
also  dramatically  decreases  solution  times  for  a  variety 
of  practical  problems,  as  shown  in  [3]. 

In  the  multigrid  strategy  first  relaxation  is  used  to 
obtain  an  approximation  with  smooth  error  on  a  fine 
grid.  Then,  given  the  smoothness  of  the  error,  correc¬ 
tions  to  this  approximation  are  calculated  on  a  co<irser 
grid,  and,  in  order  to  do  this,  first  relaxations  are  ex¬ 
ecuted,  then  correction  are  calculated  recursively  on 
still  coarser  grids.  The  nested  iteration  scheme  (use 
of  coarser  grids  to  provide  a  good  starting  point  for 
finer  grids)  is  used  to  speed  up  the  initial  part  of  the 
computation.  Historically  these  ideas  were  developed 
starting  from  the  sixties  by  Bakhvalov,  Fedorenko  and 
others  (see  Stfiben  et  al.  [13]  ). 

It  is  shown  in  [3]  that,  with  a  few  modifications  in 
the  basic  algorithms,  the  actual  solution  (not  the  error) 
can  be  stored  in  each  layer.  This  method  is  particularly 
useful  for  visual  reconstruction,  where  we  are  interested 
not  only  in  the  finest  scale  result  but  also  in  the  mul¬ 
tiscale  representation  developed  as  a  byproduct  of  the 
solution  process.  This  is  called  full  approximation  stor¬ 
age  algorithm  and  it  is  briefly  described  in  what  follows. 

The  algebraic  system,  obtained  by  discretizing  the 
original  problem  on  the  different  grids  (numbered  by  k 


with  0  <  A;  <  £,  0  =  coarsest  )  is 


A*****  =  d**  (1) 

The  data  on  the  finest  grid  define  while  for  the 
coarser  grids  the  right  hand  side  d'**  is  obtained  using 
the  two  extension  (fine  to  coarse  )  and  interpolation 
(coarse  to  fine)  operators,  respectively  and  in  this 
way 


d'**  =  A'**  (/Tz'**+‘)  -I-  V  (d'“+*  -  A'“+‘i'**+*)  (2) 

Before  computation  is  begun  on  a  grid  Bner  than  the 
current  one,  the  initial  values  for  x  are  updated  as: 


, —  jfck  +  1 1  (.fck-i  _  (3) 

Instead,  before  computation  is  begun  on  a  grid 
coarser  than  the  current  one,  the  initial  values  for  x 
are  updated  as: 


s'**  < —  /^ *'**+*  (4) 

The  switching  of  control  between  different  grids  is 
explained  in  Figure  2 

The  sequential  multigrid  algorithm  was  used  for  solv¬ 
ing  PDF’s  associated  with  different  early  vision  prob¬ 
lems  in  [14],  obtaining  typical  speed-up  factors  of  100. 

2.1  Line  Processes 

Even  if  there  are  some  results  in  the  literature  (see  [15], 
[11,7]),  up  to  now  it  is  not  clear  how  to  combine  in  an 
optimal  way  the  surface  reconstruction  and  the  discon¬ 
tinuity  detection  processes. 

Various  approaches  are  based  on  different  degrees  of 
cooperativity  ol  the  two  processes,  considering  both  time 
and  scale  (see  [2]  for  details). 

In  some  cases  the  discontinuity  detection  step  is  as¬ 
signed  to  a  separate  preliminary  process.  Assuming 
this,  in  a  regularixation  approach  the  smoothness  con¬ 
straint  will  no  more  be  enforced  globally,  but  locally 
depending  on  the  presence  or  absence  of  line  processes. 

In  other  schemes,  discontinuities  are  detected  after 
the  smoothing  step,  for  example  by  taking  derivatives  ^ 
and  thresholding  them  appropriately. 

^Thii  definition  agrees  with  the  idea  that  coarse-tcale  correc¬ 
tions  are  a  top  -  down  influence.  The  definition  given  in  ‘mathe- 
matical”  texts  is  usually  the  opposite,  so  beware. 

^Error  in  derivatives  will  be  smaller  after  regularisation. 
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Figure  2:  Flow  of  control  in  sequential  multigrid  (adapted  from 
Brandt). 

Finally  other  proposals  consider  cooperation  of  the 
two  processes  tn  time  but  do  not  consider  the  problem 
of  organizing  the  cooperation  in  scale. 

In  (9]  for  example  a  new  term  is  added  to  the  en> 
ergy  function  to  favor  a  good  discontinuity  structure. 
In  their  hardware  implementation,  an  analog  network 
minimizes  the  “smoothness  and  data  agreement  en¬ 
ergy”  while,  in  a  cyclic  way,  a  digital  network  updates 
the  line  processes  minimizing  the  “discontinuity  en¬ 
ergy”. 

Summarizing,  in  the  first  two  approaches  one  pro¬ 
cess  cannot  make  use  of  information  exchange  with  the 
“dual”  one,  while  in  the  last  computation  tends  to  be 
very  slow  for  large  images. 

We  propose  to  combine  dbcontinuity  detection  and 
surface  reconstruction  in  time  and  scale.  To  do  this, 
we  introduce  line  processes  at  different  scales,  interact¬ 
ing  with  neighboring  depth  processes  (henceforth  DPs) 
at  the  same  scale  and  with  neighboring  line  processes 
(henceforth  LPs)  on  the  finer  and  coarser  scale.  The 
reconstruction  assigns  equal  priority  to  the  two  process 
types. 

The  recursive  multiscale  call  mg  (lay)  is  based  on 
an  alternation  of  relaxation  steps  and  discontinuity  de¬ 
tection  steps  as  follows  (in  C  language  ); 

void  mg(layer)  int  layer; 

{ 

Int  i; 


if  (layer-'coarsest)  step  (layer)  ; 
el8e{ 

i>na:whlle(i — )8tep(layer) ; 
l-nb;if (il-0) 

{up(layer) ;while(l — )mg(layer-l) ;down(layer-l) ;} 
i~nc:while(i — )8tep(layer) ; 

} 

} 

void  8tep(layer)  int  layer; 

{ 

exchange_border^trip (layer) ; 
update.line.proceaaea (layer)  ; 
relax_depth_proce88e8  (layer) ; 

Each  step  is  preceded  by  an  exchange  of  data  on  the 
border  of  the  assigned  domains,  as  explained  in  section 
3  dedicated  to  the  parallel  implementation. 

As  we  will  show  in  the  following  this  scheme  not  only 
greatly  improves  convergence  speed  ( the  typical  multi¬ 
grid  effect)  but  also  produces  a  more  consistent  recon¬ 
struction  of  the  surface  at  different  scales. 

2.2  Mutual  Interaction 


During  the  course  of  the  reconstruction,  each  LP  up¬ 
dates  its  value  in  a  manner  depending  on  the  values  of 
other  connectedLPs  in  a  neighborhood.  It  is  useful  to 


Figure  3:  LP  neighborhood  intide  a  given  layer, 
define  three  different  subsets  of  this  neighborhood:  the 
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set  of  connected  LPs  at  the  same  scale  SSN,  its  subset 
SSN*  lacking  the  two  parallel  LPs  (defined  as  the  LPs 
at  both  sides  of  the  given  one  and  with  the  same  ori¬ 
entation,  see  Figure  3)  and  the  set  DSN  containing  the 
connected  LPs  at  the  coarser  and  Sner  scales. 

Considering  first  the  SSN,  a  LP  is  connected  to  other 
LPs  at  the  same  resolution  with  the  pattern  shown  in 
Figure  3.  The  influence  of  the  parallel  LPs  (suggested 
in  [9])  becomes  essential  in  the  multi-scale  scheme,  to 
avoid  duplication  of  lines  caused  by  the  coarse  to  fine 
influence. 

Considering  the  choice  of  the  connections  between  dif¬ 
ferent  layers  for  the  grid  geometry  used  (see  Figure  4), 
it’s  apparent  that  there  is  no  immediate  definition  of 
the  LPs  above  or  below  a  given  one. 

One  possible  solution  is  based  on  this  prescription: 
if  neuron  x  in  scale  X  is  connected  to  y  in  scale  Y, 
then  conversely  y  will  be  connected  to  z  (symmetry  in 
scale).  In  this  case  the  fine  to  coarse  influence  is  derived 
uniquely  after  defining  the  coarse  to  fine  one. 

Given  this,  LPs  in  the  coarser  scale  are  connected  if 
they  have  minimum  distance  (in  the  z  —  y  plane)  to 
the  given  LP.  With  this  definition  some  LPs  will  have 
two  LPs  above  with  minimum  distance,  while  others 
will  have  one.  This  asymmetry  can  be  corrected  by 
adjusting  the  connection  weights  so  that  the  combined 
influence  of  the  two  minimum  distance  LPs  (defined  as 
weak  influence)  will  be  the  same  as  the  influence  of  the 
single  LP  in  the  other  case  (defined  as  strong  influence), 
as  will  be  shown  in  the  following  section. 

2.3  Updating  Rule  and  Look-up  Table 

Starting  from  partial  "visuar  information,  the  dynam¬ 
ical  system  of  the  line  and  depth  processes  on  the  dif¬ 
ferent  scales  must  evolve  in  time  to  a  state  correspond¬ 
ing  to  a  faithful  reconstruction  of  the  three  dimensional 
structure  and  a  perceptual  grouping  of  it  into  "mean¬ 
ingful*  pieces.  Therefore  activation  of  LPs  must  be 
favored  either  by  the  presence  of  a  "large”  difference  in 
the  z  values  of  the  nearby  DPs  or  by  the  presence  of 
a  partial  discontinuity  structure  that  can  be  improved. 
Because  usually  the  perceptual  grouping  corresponds 
to  the  underlying  physical  structure,  these  two  driving 
forces  cooperate  to  create  the  desired  results. 

Let  us  define  as  benefit  the  square  of  the  derivative  at 
a  given  point  (activation  of  the  LP  is  "beneficial”  when 
this  quantity  is  large): 

Benefit  =  (dz/dx)^  «  ~ 

for  a  vertical  LP,  and  let’s  introduce  a  cost  for  a  line 
process  in  a  given  environment 


Cost  =  /(LPs  6  SSN;  LPs  €  DSN  ) 

The  updating  rule  for  a  LP  is  given  by 

LP  ♦—  1  iff  Cost  <  Benefit 

Because  the  Cost  is  a  positive  quantity  LPs  will  be 
switched  on  only  when  there  is  a  difference  in  nearby  z 
values.  Moreover,  since  the  Cost  depends  on  the  LPs 
neighborhood,  a  good  dbcontinuity  structure  can  be 
favored  by  "discounting”  Cost  if  the  local  structure  b 
improved  by  the  given  LP. 

Cost  is  a  function  of  a  limited  number  of  binary  vari¬ 
ables,  therefore  to  increase  simulation  speed  and  to  pro¬ 
vide  a  convenient  way  for  simulating  different  heurbti- 
cal  proposab,  a  look-up  table  approach  was  used. 

As  shown  in  Figure  5,  values  of  nearby  LPs  are  used 
to  form  an  index  into  the  table  containing  the  Cost 
values  *. 

2.4  Invariance,  Scale  and  Topology 

Segmentation  should  not  depend  on  the  physical  scale 
of  the  structure.  If  the  depth  values  of  a  surface  are 
multiplied  by  a  constant,  the  same  dbtribution  of  line 

*For  6  neighbors  one  gets  a  256  entry  table 
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Figure  5:  Table  look-up  for  line  processes. 

processes  must  be  obtained  by  scaling  the  Costs  ap¬ 
propriately.  Besides,  the  “topological”  influence  (en¬ 
forcement  of  good  discontinuity  structure)  should  be 
independent  of  scale. 

To  separate  the  effects  of  scale  and  topology  we  de¬ 
cided  to  isolate  the  scale  factor  into  one  parameter  dh, 
corresponding  to  the  typical  size  of  dz/dx  and  dz/dy 
that  we  want  to  be  detected  by  our  LPs: 

Costo  =  /(0,...,0;0,...,0)  =  dh^ 

It  would  be  of  little  practical  use  to  allow  256  degrees 
of  freedom  in  the  definition  of  Costs  for  the  SSN.  Rota¬ 
tional  invariance  must  be  valid.  If  a  given  configuration 
b  rotated,  Cost  must  remain  equal. 

We  decided  to  classify  all  possible  SSN*  configura¬ 
tions  (let’s  neglect  the  effect  of  the  parallel  LPs  for  the 
moment)  into  groups,  depending  on  the  number  of  re¬ 
gions  in  which  the  surface  b  divided  at  the  location 
of  the  dbcontinuity.  For  some  examples,  see  Figure  6. 
The  Cost  for  a  neighborhood  with  n  cuts  b  multiplied 
by  a  parameter  a^.  If  the  number  of  cuts  b  too  large 
Cost  b  set  to  a  very  large  value  (to  penalbe  formation 
of  “tangled”  lines). 

Cost„  =  /(LPs  €  SSN* ,0,0;  0,...,0  )  =  Costo  x  a„ 
if  local  surface  patch  b  cut  into  n  pieces 
Cost  =  oo  if  n  >  5. 


Figure  6:  Rotational  invariance  and  topological  claaaea. 

The  “inhibitory”  influence  of  parallel  lines  b  de¬ 
scribed  by  factor  o,,  with 

Cost(LP8  €  SSN;  0,...,0  )  = 

Cost(LP8  €  SSN* ,0,0;  0,...,0  )  x  a"" 

where  =  number  of  parallel  LPs  £  SSN. 

Last  but  not  least,  presence  of  lines  at  the  coarser  or 
finer  scale  will  reduce  Cost  by  factors  r„  or  respec¬ 
tively,  in  the  strony  influence  case.  In  the  uieaib  influence 
case  the  factors  become  y/r^  or  y/ra 

Cost{LPs  e  SSN;  LPs  €  DSN  )  = 

Cosf(LPs  6  SSN;  0,...,0  )  x  x 
where  =  number  of  above  LPs  €  DSN 
(xl/2  if  weaib  influence). 

The  parameters  a„  were  chosen  by  trial- and-eiTOT. 

^Let' s  remember  that  the  combined  weak  influence  of  two  LPi 
(equal  to  y/r^  x  y/f^}  muft  be  equal  to  the  itronf  influence  of  a 
■ingle  LP  (equal  to  r«  ). 
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3  Parallel  Implementation 

The  multigrid  algorithm  described  in  the  previous  sec* 
tion  can  be  executed  in  different  ways  on  a  paraUel  com¬ 
puter.  One  essential  distinction  that  has  to  be  done  is 
related  to  the  number  of  processors  available  and  the 
"sise”  of  a  single  processor. 

If  implementation  is  done  on  a  SIMD  parallel  com¬ 
puter  with  a  number  of  processors  comparable  to  the 
number  of  computational  units,  the  strategy  that  as¬ 
signs  one  processor  to  each  unit  (see  [4])  obtains  the 
maximum  amount  of  parallelism.  The  drawback  of 
this  approach  is  that  if  the  implementation  is  on  a  hy¬ 
percube  parallel  computer  and  if  the  mapping  is  such 
that  all  the  communications  paths  in  the  pyramid  are 
mapped  into  communication  paths  in  the  hypercube 
with  length  bounded  by  two  [4],  a  fraction  of  the  nodes 
is  never  used  (one  third  for  two-dimensional  problems 
encountered  in  vision).  Furthermore,  if  the  standard 
multigrid  algorithm  is  used,  when  iteration  is  on  a 
coarse  scale  all  the  nodes  in  the  other  scales  (i.e.  the 
majority  of  nodes)  are  idle  and  the  efficiency  of  compu¬ 
tation  is  in  part  compromised.  To  ameliorate  this  prob¬ 
lem,  intrinsically  parallel  multiscale  algorithms  must  be 
considered  [5|. 

Fortunately,  for  a  MIMD  computer  with  power¬ 
ful  processors,  sufficient  distributed  memory  and  two- 
dimensional  internode  connections  (in  particular  the 
hypercube  contains  a  two  dimensional  mesh),  the  above 
problems  do  not  exist. 

In  this  case  a  two-dimensional  domain  decomposition 
can  be  used  efficiently:  a  slice  of  the  image  with  its 
associated  pyramidal  structure  is  assigned  to  each  pro¬ 
cessor.  All  nodes  are  working  all  the  time,  switching 
between  different  levels  of  th  pyramid  as  illustrated  in 
Figure  7. 

No  modification  to  the  sequential  algorithm  is  needed 
for  points  in  the  image  belonging  to  the  interior  of  the 
assigned  domain.  On  the  contrary  points  on  the  bor¬ 
der  need  to  know  values  of  points  sissigned  to  a  nearby 
processor.  With  this  purpose  the  assigned  domain  is 
extended  to  contain  points  assigned  to  nearby  proces¬ 
sors  and  a  communication  step  before  each  iteration  on 
a  given  layer  is  responsible  for  updating  this  strip  so 
that  it  contains  the  correct  (most  recent)  values.  Only 
two  exchanges  are  necessary,  as  shown  in  Figure  8. 

3.1  Communication  Overhead  and 
Complexity 

Multigrid  algorithms  are  optimal  in  the  sense  that 
they  can  compute  a  solution  in  time  proportional  to 
the  number  of  unknowns.  Let’s  suppose  that  com¬ 
plexity  for  the  standard  algorithm  is  (asymptotically) 


Figure  7:  Domain  decomposition  for  multigrid  computation. 
Processor  communication  is  on  a  two-dimensional  grid,  each  pro¬ 
cessor  operates  at  all  levels  of  the  pyramid. 

Ttme  =  an,  where  n  is  the  number  of  pixels  and  a 
depends  on  the  details  of  the  algorithm. 

To  a  good  approximation  the  complexity  for  the  par¬ 
allel  version  is: 


Time  =  a  ^  (5) 

where  D  is  the  number  of  domains  (equal  to  the  number 
of  processors).  The  communication  overhead  is  a  “sur¬ 
face  effect”  proportional  to  the  linear  dimension  of  the 
domain.  The  proportionality  factor  p  depends  on  the 
number  of  iterations  and  on  the  height  of  the  pyramidal 
structure. 

Preliminary  timing  has  been  done  using  a  board 
with  four  processors  °  obtaining  times  of  600-900ms  for 
65x65  images.  Each  node  spends  approximately  20% 
of  its  time  in  intemode  communication.  In  addition 
some  time  b  required  to  load  the  data  and  read  results. 
Results  are  illustrated  graphically  in  Figure  9. 

4  Results:  Shape  from  Shading 

An  iterative  scheme  for  solving  the  shape  fwm  shading 
problem  has  been  proposed  in  [8].  A  preliminary  phase 

*Definieom  board  with  Traniputen,  software  from  Parasoft 
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Figure  8:  Communication  strategy:  each  node  contains  a  strip 
of  data  assigned  to  nearby  processors.  Values  are  updated  before 
each  iteration  using  exchanges  in  the  two  directions. 

recovers  information  about  orientation  of  the  planes 
tangent  to  the  surface  at  each  point  by  minimizing  a 
functional  containing  the  image  irradiance  equation  and 
an  xntegrability  constraint,  as  follows: 


E{p, q)=  (I(x,  y)  -  R(p,  q))^  +  Hpy-  qxfdxdy 

J  Image 

(6) 

where  p  =  dzjdx,  q  =  dz/dy,  /=  measured  intensity, 
and  R=  theoretical  reflectance  function. 

After  the  tangent  planes  are  available,  the  surface  z 
is  reconstructed  minimizing  the  following  functional: 


E{z)  =  (z,  -  pf  +  {zy  -  q)^dxdy  (7) 

J  Image 

Figure  10  shows  the  reconstruction  of  the  shape  of 
an  hemispherical  surface  starting  from  a  ray-traced  im¬ 
age  Above  is  the  result  of  standard  relaxation  af¬ 
ter  100  sweeps,  below  the  “minimal  multigrid”  result  * 
whose  total  solution  time  is  equivalent  to  approximately 
four  iterations  on  the  finest  grid. 

limple  Lambertian  reflection  model  is  used. 

*V  cycles  with  one  relaxation  on  each  level 


Figure  9:  'timing  results.  Above:  time  spent  exchanging  data 
(change)  and  communicating  with  the  host  (read-write). 


Thb  case  is  particularly  hard  for  a  standard  relax¬ 
ation  approach.  The  image  can  be  interpreted  “legally* 
in  two  possible  ways:  either  as  a  concave  or  a  convex 
hemisphere.  Starting  from  random  initial  values,  some 
image  patches  will  “vote”  for  one  or  the  other  interpre¬ 
tation  and  try  to  extend  the  local  interpretation  to  a 
global  one.  This  not  only  takes  time  (given  the  local 
nature  of  the  updating  rule)  but  encounters  an  endless 
struggle  in  the  regions  that  mark  the  border  between 
different  interpretations.  The  multigrid  approach  solves 
this  “democratic  impasse”  on  the  coarsest  grids  (much 
faster  because  now  information  spreads  over  large  dis¬ 
tances)  and  propagates  this  decision  to  the  finer  grids, 
that  will  now  concentrate  their  efforts  on  refining  the 
initial  approximation. 

Another  example  is  shown  in  Figure  11,  where  the 
three  dimensional  structure  of  the  Mona  Lisa  face 
painted  by  Leonardo  ^  is  reconstructed. 


'Anticipating  the  readcr'i  unhappincM  with  her  seethetic  ap¬ 
pearance,  left  remember  that  the  Lambertian  reflectance  model 
it  a  very  naive  approximation  of  the  artittic  shading  used  by 
Leonardo 
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Figure  10:  Reconstruction  of  shape  from  shading  :  standard 
relaxation  versus  multigrid. 

5  Results:  Surface  Reconstruc¬ 
tion  from  Depth  Constraints 

The  functional  for  the  surface  reconstruction  problem 
is: 


E(z{x,y))=  /  {z(x,y)-d(x,y)f  +  X(zl-^-zl)dxdy 

J I  mage 

(8) 

A  physical  analogy  is  that  of  Stting  the  data  '(z,  y) 
with  a  membrane  pulled  by  springs  connected  to  them. 
A  given  z  value  is  updated  as  follows; 


^{x,y) 


^»um  +  d(x,y) 
n.um 


(9) 


where  h  =  grid  step. 

z.um  =  X]  LP(x  +  dx,y+dy)  z(x  +  dx,y+dyy, 

dx  —  ±h;dy=  ±h 

n,um  =  ^  LP(x -<r  dx,y  +  dy)-, 

dx~  ±h:dy=  ±h 


Figure  11:  Mona  Lisa  in  three  dimensions. 

Detailed  performance  tests  have  been  made  using 
noisy  data  for  z  values  corresponding  to  “Randomville” 
structures.  These  are  obtained  by  generating  random 
coordinates,  heights,  slants  and  tilts  for  quadrangular 
blocks  and  placing  'hem  in  the  image  plane.  The  data 
are  then  corrupted  by  noise  and  loaded  as  constraints 
in  the  algorithm. 

For  129  X  129  “images”  and  noise  values  correspond¬ 
ing  to  25%  of  the  highest  structure,  a  faithful  recon¬ 
struction  of  the  surface  (within  a  few  percent  of  the 
original  one)  is  normally  obtained  after  one  single  mul¬ 
tiscale  sweep  (with  V  cycles)  on  four  layers 

The  total  computational  time  corresponds  approxi¬ 
mately  to  the  time  required  by  3  relaxations  on  the 
finest  grid.  Because  of  the  optimality  of  multiscale 
methods,  time  increases  linearly  with  the  number  of 
image  pixels. 


The  effect  of  active  discontinuities  (  LP=1  )  is  that 
of  inhibiting  the  smoothing  action  at  their  location. 


"’In  other  words,  parameters  na.nb.nc  in  agO  are  equal  to 

one. 
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User  interfaces  examples  and  results  from  some  tests 
are  shown  in  the  last  figures.  Figure  12  shows  the 
simulation  environment  on  the  SUN  workstation,  Fig¬ 
ure  13  and  Figure  14  show  the  reconstruction  of  a  typ¬ 
ical  “Randomville”  image.  The  original  surface  ,  the 
surface  corrupted  by  noise  (25  %  )  and  reconstruction 
on  different  scales  cure  shown  in  this  order. 
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Figure  12:  Simulation  environment. 


6  Summary  and  Discussion 

The  extension  of  multiscale  methods  to  encorporate  dis¬ 
continuity  detection  can  be  done  in  an  effective  way, 
combining  reconstruction  and  discontinuity  detection 
in  time  and  scale. 

This  reduces  total  computational  time  by  orders  of 
magnitude  with  respect  to  single  scale  methods  and 
provides  a  better  coordination  between  the  two  require¬ 
ments  of  faithful  reconstruction  and  good  discontinuity 
structure. 

The  algorithm  can  be  efficiently  executed  on  a  paral¬ 
lel  computer  and  a  two-dimensional  domain  decompo¬ 
sition  is  an  effective  approach. 
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Abstract 

The  problem  considered  in  this  work  is  that  of  estimat¬ 
ing  the  motion  field  (i.e.  the  projection  of  the  velocity 
field  onto  the  image  plane)  from  a  temporal  sequence 
of  images. 

Generic  images  contain  different  objects  with  diverse 
spatial  frequencies  and  motion  amplitudes.  To  deal 
with  this  complex  environment  in  a  fast  and  effective 
way,  biological  visual  systems  use  parallel  processing, 
visual  channels  at  different  resolutions  and  adaptive 
mechanisms.  In  this  paper  a  new  adaptive  multiscale 
scheme  is  proposed,  in  which  the  spatial  discretiza¬ 
tion  scale  is  based  on  a  local  estimate  of  the  errors  in¬ 
volved.  Considering  the  constraints  for  real-time  oper¬ 
ation,  flexibility  and  portability,  the  scheme  can  be  im¬ 
plemented  on  MIMD  parallel  computers  with  medium 
size  grains  with  high  efficiency. 

Tests  with  ray-traced  and  video-acquired  images  for 
different  motion  ranges  show  that  this  method  produces 
a  better  estimation  with  respect  to  the  homogeneous 
(non-adaptive)  multiscale  method. 


1  Introduction 

Many  low-  and  medium-level  computer  vision  problems 
can  be  formulated  in  the  context  of  partial  differential 
equations.  These  in  turn  are  transformed  into  large  al¬ 
gebraic  systems  (after  discretization)  that  can  be  solved 
using  iterative  relaxation  methods.  Homogeneous  mul¬ 
tiscale  techniques  have  been  proposed  as  a  way  to  accel¬ 
erate  the  convergence.  Finally  parallel  computation  is  a 
natural  way  to  further  reduce  the  solution  time  in  order 
to  obtain  real-time  or  close-to-real-time  performance. 

While  this  framework  is  now  widely  popular  in  com¬ 
puter  vision,  the  focus  of  this  work  is  on  an  adaptive 
modification  of  the  previous  scheme.  Adaptive  dis¬ 
cretization  is  not  introduced  to  reduce  the  computa¬ 
tional  burden  with  respect  to  the  homogeneous  multi¬ 
scale  strategy  but  to  produce  a  more  reliable  estimation 
of  the  motion  field.  The  adaptive  solution  grid  is  neces¬ 
sary  in  order  to  deal  with  the  errors  introduced  in  the 
definition  (not  in  the  solution)  of  the  algebraic  system. 


In  standard  reference  texts  on  multiscale  and  mnlti- 
grid  methods  (see  for  example  [4])  one  assumes  that  the 
coefficients  and  the  inhomogeneous  term  in  the  p.d.e. 
are  known  precisely  (at  the  finest  scale).  The  real  world 
situation  in  computer  vision  is  very  different.  Consid¬ 
ering  the  estimation  of  the  motion  field,  the  p.d.e.  co¬ 
efficients  depend  on  temporal  and  spatial  derivatives  of 
the  image  brightness  pattern.  Unfortunately  errors  are 
introduced  in  many  ways.  First  the  acquisition  process 
produces  quantization  errors  (due  to  the  finite  number 
of  gray  values)  and  possibly  random  noise.  In  addition 
the  discretized  derivative  estimation  formulas  are  valid 
when  the  step  size  can  be  considered  small  with  respect 
to  the  dominant  wavelengths  contained  in  the  Fourier 
transform  of  the  image  and  with  respect  to  the  move¬ 
ments  in  the  scene.  In  general,  evaluation  at  coarse 
spatial  scale  will  suffer  from  quantization  noise  (because 
the  spatial  derivatives  will  be  small),  while  evaduation 
at  finer  scale  will  tend  to  be  unreliable  if  short  wave¬ 
lengths  are  present.  The  appropriate  scale  for  the  defi¬ 
nition  of  the  p.d.e.  coefficients  therefore  depends  on  the 
estimation  errors  involved  and  is  in  general  different  for 
the  different  parts  of  the  image. 

2  Reliable  Computation  of  the 
Motion  Field 

In  particular  situations  the  apparent  motion  of  the 
brightness  pattern,  known  as  the  optical  flow,  provides 
a  sufficiently  accurate  estimate  of  the  motion  field.  Al¬ 
though  the  adaptive  scheme  proposed  in  this  paper  is 
applicable  to  different  methods,  the  discussion  will  be 
based  on  the  scheme  proposed  by  Horn  and  Schunck 
(12|.  They  use  the  assumptions  that  the  image  bright¬ 
ness  of  a  given  point  remains  constant  over  time,  and 
that  the  optical  flow  varies  smoothly  almost  every¬ 
where.  Satisfaction  of  these  two  constraints  is  formu¬ 
lated  as  the  problem  of  minimizing  a  quadratic  en¬ 
ergy  functional  (see  also  [17]).  The  appropriate  Euler- 
Lagrange  equations  are  then  discretized  on  a  single  or 
multiple  grids  and  solved  using  for  example  the  Gauss- 
Seidel  "relaxation*  method.  The  reader  interested  in 
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the  detailed  derivation  is  referred  to  [12,21].  The  re¬ 
sulting  system  of  equations  (two  for  every  pixel  in  the 
image)  is; 


+  lyVij  +  h)h  =  (1) 

o? 

+  lyVij  +  It)Iy  =  (2) 

where  tiij  =  ^  and  Vij  =  ^  are  the  optical  flow 
variables  to  be  determined,  Ix,fy,  It  the  partial 
derivatives  of  the  image  brightness  with  respect  to 
space  and  time,  &  and  0  are  local  averages  (Qi,y  = 

Ax  is  the  spa- 

tial  discretisation  step,  and  a  controls  the  smoothness 
of  the  estimated  optical  flow. 

Now,  the  partial  derivatives  in  equations  1  and  2  need 
to  be  estimated  with  discretized  formulas  starting  from 
brightness  values  that  are  quantized  (say  integers  from 
0  to  n)  and  noisy.  It  is  easy  to  show  that,  given  these 
derivative  estimation  problems,  the  optimal  step  for  the 
discretisation  grid  depends  on  local  properties  of  the 
image.  Use  of  a  single  discretisation  step  produces  large 
errors  on  some  images.  Use  of  a  homogeneous  multiscale 
approach  where  a  set  of  grids  at  different  resolutions  is 
used,  may  in  some  cases  produce  a  good  estimation  on 
an  intermediate  grid  and  a  bad  one  on  the  final  and 
finest  grid.  Enkelmann  and  Glaser  [8,11]  encountered 
similar  problems. 

We  propose  a  method  for  “tuning”  the  discretisation 
grid  to  a  measure  of  the  reliability  of  the  optical  flow 
derived  at  a  given  scale.  This  measure  will  be  based 
on  a  local  estimate  of  the  errors  due  to  noise  and  dis¬ 
cretisation. 

The  rest  of  this  paper  is  organised  as  follows.  First 
we  discuss  some  fundamental  shortcomings  of  the  ho¬ 
mogeneous  multiscale  version  and  derive  a  formula  for 
the  error  in  derivative  estimation.  Next,  we  describe 
our  scheme  with  adaptive  discretisation  and  discuss 
the  multiprocessor  implementation.  Fin<illy,  we  present 
some  experimental  results  obtained  with  some  image 
sequences. 

3  Errors  in  Derivative  Estima¬ 
tion 

The  difficulties  introduced  by  erroneous  derivative  es¬ 
timation  can  be  illustrated  with  the  following  one¬ 
dimensional  example.  Let’s  suppose  that  the  intensity 
pattern  observed  is  a  superposition  of  two  sine  waves 
of  different  wavelengths: 


I{x,t)  a  (1  +  ii  -f-  sin(-— (i  -  2t))  +  ii8in(-— (x  -  2t))) 
o  3 

(3) 

where  R  is  the  ratio  of  short  to  long  wavelength  compo¬ 
nents.  Using  the  brightness  constancy  assumption  [12] 
the  measured  velocity  v  is  given  by: 


5  =  -itih  (4) 

where  /,  and  It  are  the  three-point  approximations  of 
the  spatial  and  temporal  brightness  derivatives^. 

Now,  if  one  calculates  the  estimated  velocity  on  two 
different  grids,  with  spatial  step  Ax  equal  to  1  and  2, 
as  a  function  of  the  parameter  R  one  obtains  the  result 
illustrated  in  Figure.  1.  While  on  the  coarser  grid  the 


Figure  1:  Measured  velocity  for  superposition  of  sinusoidal 
patters  as  a  function  of  the  ratio  of  short  to  long  wavelength 
components.  Dashed  line:  estimation  with  Az  =  2,  continuous 
line:  estimation  with  Az  =  1,  The  correct  velocity  is  equal  to  2 
(A«  =  1). 

correct  velocity  is  obtained  (in  this  case),  on  the  finer 
one  the  measured  velocity  depends  on  the  value  of  R. 
In  particular,  if  fZ  is  greater  than  0.5  a  velocity  in  the 
opposite  direction  is  obtained! 

*That  is  /z  =  —  !{*  —  Az))  and  analogously 

for  it  ■ 


For  a  general  one-dimensional  profile  /(z  —  vt)  it  is 
easy  to  derive  (using  Taylor  expansion)  the  following 
approximation  for  the  relative  velocity  error  due  to  dis¬ 
cretization: 


6v 

V 


6/'(y) 


((uAt)2-(A*)2) 


(5) 


Considering  also  the  errors  introduced  by  quantization 
one  obtains  (after  substitutions  and  reasonable  assump¬ 
tions  described  in  [2]): 


*  u  p2n2  '  (A,/)2  (At7)2 

(6) 

where  Ax  I  and  At  I,  are  the  spatial  and  temporal  dif¬ 
ferences  in  intensity  values^.  These  differences  grow  lin¬ 
early  with  the  number  of  discretization  levels  n.  There¬ 
fore,  while  the  first  term  of  the  overall  relative  error 
does  not  depend  on  n,  the  second  term,  which  expresses 
the  contribution  due  to  the  quantization  process,  de¬ 
creases  with  n  and  can  be  reduced  by  increasing  the 
number  of  quantization  levels.  C  is  assumed  to  be 
a  constant  (with  an  heuristic  value  of  ^  derived  in 
the  case  of  sinusoidal  patterns).  Finally  the  param¬ 
eter  p  {fractional  range  of  intensity  values  in  a  given 
image)  is  needed  in  the  case  of  over-  or  under-exposed 
images.  The  two-dimensional  estimate  of  the  overall 
relative  error  is  obtained  firom  eqn.  6  by  rotational  in¬ 
variance,  substituting  (A*/)^  with  (A,/)^  +  (Ayl)^. 
This  amounts  to  measuring  the  field  unreliability  ac¬ 
cording  to  the  error  in  the  component  of  the  velocity 
that  b  normal  to  the  brightness  gradient. 


without  further  processing.  This  is  done  by  setting  an 
inhibition  flag  contained  in  the  grid  points  of  the  pyra¬ 
midal  structure,  so  that  these  points  do  not  participate 
in  the  relaxation  process.  On  the  contrary,  if  the  er¬ 
ror  b  larger  than  T^rn  the  approximation  b  relaxed 
on  a  finer  scale  and  the  entbe  process  b  repeated  un¬ 
til  the  finest  scale  b  reached.  A  local  inhomogeneous 
approach  b  thus  obtained,  where  areas  of  the  images 
characterbed  by  different  spatial  frequencies  or  by  dif¬ 
ferent  motion  amplitudes  are  processed  at  the  appropri¬ 
ate  resolutions,  avoiding  corruption  of  good  estimates 
by  inconsbtent  information  from  a  different  scale  (the 
effect  shown  in  the  previous  example).  The  optimal  grid 
structure  for  a  given  image  b  translated  into  a  pattern 
of  active  and  inhibited  grid  points  in  the  pyramid,  as 
illustrated  in  Figure  2. 


4  Adaptive  Multiscale  Solution 
on  a  Multicomputer 

The  previous  example  and  considerations  suggest  a  new 
strategy.  First  a  Gaussian  pyramid  [5]  b  computed 
from  the  given  images.  Thb  consbts  of  a  hierarchy  of 
images  obtained  filtering  the  original  ones  with  Gaus¬ 
sian  filters  of  progressively  larger  sbe^. 

Then  the  optical  flow  field  b  computed  at  the  coars¬ 
est  scale  using  relaxation,  and  the  estimated  error  ac¬ 
cording  to  eqn.  6  b  calculated  for  every  pixel.  If  thb 
quantity  b  less  than  a  given  threshold  Terr,  fbc  current 
value  of  the  flow  b  interpolated  to  the  finer  resolutions 

*That  i«  A,/  =  {/(z  +  Ax)  -  I{x  -  Ax))  and  analogously  for 
A,/. 

^Three  levels  are  used  for  65x65  images. 


Figure  2:  Adaptive  grid  and  activity  pattern  in  the  multireso- 
lution  pyramid. 

The  motivation  for  freezing  the  motion  field  as  soon 
as  the  error  b  below  threshold  b  that  the  estimation 
of  the  error  may  itself  become  incorrect  at  finer  scales 
and  therefore  useless  in  the  decbion  process.  It  b  im¬ 
portant  to  point  out  that  single  scale  or  homogeneous 
approaches  cannot  solve  adequately  the  above  problem. 
Intuitively  what  happens  in  the  adaptive  multiscale  ap¬ 
proach  b  that  the  velocity  b  frozen  as  soon  as  the  spa- 
tial  and  temporal  differences  at  a  given  scale  are  big 
enough  to  avoid  quantbation  errors  but  small  enough 
to  avoid  errors  in  the  use  of  discretbed  formulas.  The 
only  assumption  made  in  thb  scheme  b  that  the  largest 
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motion  in  the  scene  can  be  reliably  computed  at  one  of 
the  used  resolutions.  If  the  images  contain  motion  dis¬ 
continuities,  line  processes  (indicating  the  presence  of 
these  discontinuities)  are  necessary  to  prevent  smooth¬ 
ing  where  it  is  not  desired  (see  |l]  and  the  contained 
references). 

The  multiscale  algorithm  described  in  the  previous 
section  can  clearly  be  executed  in  different  ways  on  a 
parallel  computer  (for  a  pictorial  representation  see  Fig¬ 
ure  3). 


Figure  3;  Mapping  between  a  (one-dimenaional)  multiacale 
structure  and  hypercubes  of  different  “grain  siees”. 

Considering  first  an  implementation  on  a  SIMD  par¬ 
allel  computer  with  a  large  number  of  processors,  the 
maximum  amount  of  parallelism  is  obtained  assigning 
one  processor  to  each  grid  point  [6,13|.  Unfortunately 
this  scheme  presents  a  serious  disadvantage:  while  re¬ 
laxation  is  executed  at  a  given  level  of  the  pyramid  aU 
processors  assigned  to  different  levels  ^lre  inactive.  In 
a  pyramid  with  L  levels  the  hardware  utilization  effi¬ 
ciency  will  therefore  be  1/L.  For  example,  if  an  image 
with  512x512  pixels  is  analyzed  at  6  different  resolu¬ 
tions  the  efficiency  is  at  most  0.16. 

If  the  pyramidal  structure  of  grid-points  cannot  be 
mapped  in  a  one-to-one  manner  onto  the  processing 
nodes  of  a  given  architecture,  an  additional  reduction 
in  hardware  utilization  efficiency  is  present.  For  exam¬ 
ple,  if  the  implementation  is  on  a  fine  grain  hypercube 
parallel  computer  and  if  the  mapping  is  such  that  one 
processor  is  assigned  to  each  grid  point,  a  fraction  of 


the  nodes  is  left  unassigned  [6].  This  stems  from  the 
fact  that  the  total  number  of  grid-points  in  the  pyrar 
mid  is  not  necessarily  close  to  a  power  of  two.  For  the 
two-dimensional  problem  considered,  if  the  number  of 
pixels  at  the  finest  resolution  is  2”,  the  total  number  of 
grid-points  in  the  complete  pyramid  is: 

2"(l  +  l  +  (i)’+-+(j)5)»52" 

Because  an  (n  -t- 1)  dimensional  hypercube  (with  2"''‘^ 
nodes)  is  needed,  only  66%  of  the  nodes  will  be  assigned 
to  grid  points. 

The  efficiency  of  the  parallel  implementation  of  the 
multiscale  algorithm  that  assigns  one  processor  to  each 
pixel  (or  to  each  grid-point  at  the  finest  resolution)  is 
furthermore  limited  by  the  fact  that  when  relaxation 
is  executed  at  coarse  resolutions  many  processors  are 
inactive.  A  detailed  calculation  [6]  shows  that  the  ef¬ 
ficiency  decreases  at  the  rate  of  as  the  grid  size 
/  increases  *.  Considering  the  communication  over¬ 
head,  the  optimal  mapping  (using  hierarchical  Gray 
code  |6,9])  is  such  that  the  distance  between  neighbor¬ 
ing  points  on  the  coarser  grids  is  equal  to  two,  causing 
some  delays  especially  on  old  architectures  with  direct 
communication  limited  to  neighboring  processors. 

The  above  discussion  provides  some  motivation 
for  the  use  of  large  grain-size  multicomputers  (see 
also  [10,7]  and  [20]  for  a  general  discussion). 

In  this  case  a  simple  two-dimensional  domain  decom¬ 
position  can  be  used  efficiently:  a  slice  of  the  image  with 
its  associated  pyramidal  structure  is  assigned  to  each 
processor.  Implementing  the  adaptive  strategy,  more 
complex  schemes  with  dynamic  load  balancing  are  not 
needed  because  a  real-time  scheme  is  supposed  to  pro¬ 
duce  a  solution  in  the  given  time  in  the  worst  possible 
case,  when  all  gn^^d  units  are  active  (this  situation  may 
correspond  to  images  with  fine  details  in  all  regions  of 
the  scene).  All  nodes  are  working  all  the  time,  switch¬ 
ing  between  different  levels  of  the  pyramid  as  illustrated 
in  Figure  4. 

No  modification  of  the  sequential  algorithm  is  needed 
for  points  in  the  image  belonging  to  the  interior  of  the 
assigned  domain.  On  the  contrary,  points  on  the  do¬ 
main  boundary  need  to  know  values  of  points  assigned 
to  nearby  processors.  With  this  purpose  the  domain 
assigned  to  each  processor  is  extended  with  an  overlap 
area  and  a  communication  step  on  a  given  layer  is  used 
before  each  iteration,  as  described  in  [9,15,1].  Only  two 
exchanges  are  necessary  (one  in  the  north-south,  and 

*Thif  result  is  therefore  similar  to  that  obtained  in  the  scheme 
that  assigns  one  node  to  each  grid  point,  because  log,  I  n  L,  the 
number  of  levels. 
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•  IMAGE 


Figure  4:  Domain  decomposition  for  multiscale  computation. 
Processor  communication  is  on  a  two-dimensional  grid,  each  pro¬ 
cessor  operates  at  all  levels  of  the  pyramid. 

the  other  in  the  east- west  direction),  as  it  is  shown  in 
Figure  4. 

As  it  will  be  shown  in  the  examples,  the  use  of  limited 
coarsening  [19]  is  allowed  for  practical  problems  in  com¬ 
puter  vision.  In  thb  mapping  scheme  the  coarsest  level 
of  the  pyramid  contains  a  number  of  pixels  such  that 
each  processor  will  contain  at  least  four  of  them.  In  thb 
case  the  programming  envbonment  b  more  convenient, 
because  no  exceptions  are  needed  in  the  relaxation  and 
communication  routines,  and  the  efficiency  b  optimal, 
because  no  processor  b  idle  at  the  coarsest  grids.  Elim¬ 
inating  the  coarsest  grids  in  the  complete  pyramid  does 
not  affect  the  solution  time  in  a  significant  way  on  the 
problems  considered. 

5  A  Model  of  the  Architecture 
and  Algorithm 

The  efficiency  of  the  parallel  implementation  of  a  mul- 
tbcale  algorithm  depends  on  characterbtics  of  the  al¬ 
gorithm  and  on  the  performance  of  the  hardware  of  a 
given  multicomputer.  The  following  dbcussion  b  lim¬ 
ited  to  multicomputers  with  a  two-dimensional  grid  of 
connections.  We  will  dbtingubh  parameters  related  to 
the  image,  multbcale  pyramid,  processors  and  commu¬ 
nications,  algorithm,  and  mapping,  as  follows: 


/  Image  dimension  at  the  finest  resolution  (the 
number  of  pixeb  b  /  x  /),  where  /  =  2*  -I- 1. 

Ifnin  Image  dimension  at  the  coarsest  resolution 
considered  in  the  limited  coarsening  scheme. 

•  PYRAMID 

L  Number  of  leveb  in  the  complete  pyramid.  It 
easy  to  derive  the  relation  L  =  log2(/—  l)  + 1. 
The  leveb  are  numbered  according  to:  1  =  0 
(coarsest),!,...,  L  —  1  (finest). 

Icoartett  Level  number  corresponding  to  the  coars¬ 
est  resolution  used  in  a  given  implementation 
(equal  to  log2(/mtn  -  1)  )•  I  finest  b  always 

L-1. 

nuveU  Number  of  leveb  used,  equal  to  I  finest  — 

Icoarsest  "i"  !* 

•  PROCESSORS  AND  COMMUNICATIONS 

tcaic^  Time  for  one  floating  point  operation. 

Icaieji  Time  for  one  integer  operation. 

feomtn_<  Startup  time  for  message  transmission. 

tcommjt  Transfer  time  for  transmitting  the  status 
of  one  pixel  with  the  associated  line  processes 
(in  the  present  scheme  two  line  processes  are 
associated  to  each  pixel,  for  example  the  ones 
to  the  north  and  to  the  east). 

•  ALGORITHM 

Wreiax  Number  of  operations  per  pixel  during  one 
relaxation  step. 

Wd,',c  Number  of  operations  per  pixel  for  updating 
the  line  processes. 

W{n«t  Number  of  operations  per  pixel  for  initial- 
bation. 

a  Number  of  relaxations  executed  on  each  level. 

•  MAPPING 

P  Number  of  processors  in  the  multicomputer. 
The  use  of  limited  coarsening  b  allowed  pro¬ 
vided  that  the  following  relation  b  valid: 
Ifnin  >2  pi  (each  processor  must  contain  at 
least  4  pixeb  at  the  coarsest  resolution). 

E  Number  of  data  exchanges  per  relaxation  step. 

T  Thickness  of  the  overlap  area  (see  previous  sec¬ 
tion). 

Teaic  Total  calculation  time. 

Tstart  Total  startup  time  for  communication. 
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Ttran*  Total  transmission  time  for  communica^  6 
tion. 


Multicomputer  Implementa¬ 
tion 


For  the  performance  analysis  we  follow  the  model 
introduced  in  [9].  The  efficiency,  or  speedup  per  node, 
is  defined  by: 


e 


P  Peone{P) 


(7) 


The  hardware  parameters  for  some  commercial  MIMD 
computers  are  collected  in  Table  1.  Data  about  commu¬ 
nication  are  for  message  length  greater  than  IK  bytes, 
teommjkute  ^  the  transfer  rate  per  byte.  These  data 
have  been  collected  from  different  sources  [3,18],  using 
different  operating  systems  ^  and  are  not  exhaustive 
about  the  available  multicomputers. 


where  Tteq  and  Teonc  are  the  solution  times  on  the  se¬ 
quential  and  parallel  computer. 

In  order  to  calculate  the  computation  and  commu¬ 
nication  times,  it  is  useful  to  introduce  the  function 
measuring  the  number  of  work 
units  required  by  a  given  multiscale  scheme,  where  one 
work  unit  is  defined  as  the  amount  of  computation  re¬ 
quired  by  one  relaxation  at  the  finest  grid  (4).  This 
function  is  defined  as: 


teomm3vt* 

II 

1566 

1.9 

660 

535 

11  f  1  1  1  If 

1456 

1.68 

11  1  1  1  1  II  M 

7 

0.4 

■rx^fsnmgnMi 

1  MEIKO  1 

288 

1.19 

^1  gill  1  m  11^1 

136 

633 

;t  Symult  S2010 

917 

0.68 

Table  1:  Latency,  communication  time  and  floating  point  per¬ 
formance  of  some  multicomputers.  Times  are  in  iit. 


't  j 

1=1 


where  7  is  the  "dimension*  of  the  problem  (for  exam¬ 
ple  7  =  2  for  operations  on  the  two-dimensional  image 
array,  7=1  for  operations  on  the  boundaries). 

The  complete  calculation  is  composed  of  an  initial¬ 
ization  part,  where  preliminary  data  are  calculated  on 
each  layer,  and  of  the  multiscale  coarse-to-fine  scheme. 
Assuming  the  use  of  limited  coarsening,  the  resulting 
expression  for  the  total  computation  time  is: 


Pcalc  —  ~p  ^calc_/ 2)  +  (9) 

{^relax  tcalc^J  "b  t^alc  j)  (®i  2)| 

The  startup  time  is  the  same  for  each  level,  therefore 
the  total  is  given  by: 


T»tort  —  E  ^eomm-$  ^  ^levelt 
while  the  transmission  time  is 


Ptrant  —  ~p^  ^  ^ (H) 


In  the  following  we  present  the  theoretical  speedup 
that  can  be  expected  for  machines  baised  on  the 
Transputer.  For  our  problem  IVinjt  =  30;  Wrtiax  = 
25;  Wjifc  =  Ifi-  Two  relaxations  on  each  level  are  as¬ 
sumed  (a  =  2).  The  minimum  image  size  considered 
for  a  fixed  number  of  nodes  is  such  that  at  least  four 
pixels  are  assigned  to  each  procesjor.  The  number  of 
levek  used  in  equations  9  -  11  is  the  maximum  compat¬ 
ible  with  limited  coarsening,  as  described  in  the  previ¬ 
ous  section.  In  addition,  the  number  of  bytes  per  pixel 
to  transfer  is  6,  and  therefore  tcommjt  =  bfcommJwte- 
The  thickness  of  the  overlap  area  and  the  number  of 
exchanges  per  iteration  are  T  =  1  and  E  =  2,  respec¬ 
tively. 

Figure  5  shows  the  theoretical  efficiency  calculated 
for  the  cited  machine  as  a  function  of  the  image  size 
/,  using  the  parameters  defined  in  Table  1.  As  it  can 
be  seen,  the  efficiency  for  large  images  (7  >  512)  is 
greater  than  80%  if  the  number  of  processors  is  less 
than  256,  approximately.  A  lower  bound^  to  the  total 
solution  time  for  the  different  configurations  is  illus¬ 
trated  in  Figure  6.  fVom  the  graphs  it  is  clear  that 
real-time  multiscale  processing  is  within  the  reach  of 
digital  multiprocessor  technology.  The  time  for  loading 
the  image  to  the  different  processors  and  for  obtaining 
the  final  results  has  not  been  considered  and  can  be  a 
serious  limiting  factor  for  two-dimensional  architectures 

^In  particular  Expreu  from  Paraaoft  has  been  used  on 
NCUBE/1,  Symult  S2010,  Marklll  and  Meiko. 

*Inefflcienciea  caused  by  the  use  of  high-level  languages  and 
additional  overheads  caused  by  the  operating  systems  are  not 
taken  into  account. 
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Figure  5:  Theoretical  apeedup  for  the  implementation  on  the 
Transputen  as  a  function  of  the  image  size.  Different  curves  show 
the  efficiency  for  different  numbers  of  processing  nodes. 

(for  a  review  of  different  image  processing  architectures 
see  [16,22]).  The  total  processing  time  can  be  reduced 
if  overlap  of  communication  and  calculation  is  allowed 
(see  for  example  [3]).  In  this  case  relaxation  could  be 
executed  first  for  the  grid  points  on  the  boundary.  Val¬ 
ues  pertaining  to  these  points  could  then  be  exchanged 
while  the  interior  points  are  updated. 

A  program  implementing  the  multiscale  scheme  has 
been  written  in  C  language  and  tested  on  a  commercial 
parallel  processor  Because  the  communication  part 
is  limited  to  the  exchange  step  repeated  before  each  it¬ 
eration,  the  software  is  easily  portable  to  different  mul¬ 
tiprocessors.  The  purpose  of  this  implementation  has 
been  that  of  obtaining  some  concrete  experience  and 
not  that  of  benchmarking  different  machines.  In  par¬ 
ticular  the  portability  implied  by  the  use  of  a  high-level 
language  and  a  user-friendly  communication  environ¬ 
ment  *  has  been  obtained  at  the  price  of  slowing  down 
the  theoretical  performance  by  a  factor  of  two-four,  de¬ 
pending  on  implementation  details. 

Preliminary  timing  has  been  done  using  a  board  with 
eight  processors  and  test  images  of  129x129  pixels. 
The  total  solution  time  is  of  the  order  of  one-three 
seconds,  depending  on  available  memory,  compiler  and 


‘'A  Truizputcr  board  hosted  by  a  Sun  workstation. 
*Provided  by  the  Express  software  package  from  Parasoft. 
*Deflnicom  board  with  Transputers,  software  from  Parasoft 


Figure  6;  Total  solution  time  for  different  configurations. 

operating  system’s  version.  The  obtained  efficiency  is 
e  =  0.87. 

7  Test:  Expanding  Sphere 

These  examples  show  that  relaxation  per  se  in  the  ho¬ 
mogeneous  scheme  does  not  reduce  the  solution  error  in 
all  cases.  In  some  cases  too  many  relaxation  steps  may 
increase  the  error  either  because  of  smoothing  over  re¬ 
gions  with  rapidly  varying  velocity  fields  or  because  of 
propagation  of  constraints  referring  to  different  moving 
objects  (if  appropriate  line  processes  are  not  activated). 

The  first  set  of  images  consist  of  ray-traced  expand¬ 
ing  spheres  superimposed  onto  a  fixed  "natural”  back¬ 
ground  These  images  contain  a  unique  dominant 
spatial  frequency  of  the  order  of  magnitude  given  by 
the  sphere  diameter.  If  we  do  not  consider  the  ef¬ 
fect  of  quantisation  and  assume  that  the  motion  am¬ 
plitude  is  very  small  with  respect  to  the  radius,  one 
iteration  is  sufficient  to  recover  the  correct  optical  flow 
(this  is  the  special  case  of  a  velocity  vector  parallel  to 
the  brightness  gradient).  The  function  of  relaxation  is, 
in  this  case,  to  provide  a  better  estimate  by  averaging 
noisy  derivative  estimations  on  neighborhoods  with  a 
size  that  increases  with  the  number  of  relaxations  ap¬ 
plied. 

**’The  background  ii  used  in  order  to  obtain  non-zero  deriva¬ 
tives  in  this  region.  In  fact,  if  they  vanish,  all  motion  fields 
minimise  the  Horn  and  Schunck  functional. 


Unfortunately,  this  is  true  only  if  one  aesumea  that 
the  occluding  boundary  is  known  a  priori  and  if  the 
correct  velocity  is  given  on  this  boundary,  as  it  is  the 
case  in  Tenopoulos*  work  (21].  In  the  general  case  (no 
initial  information)  different  results  are  possible.  As 
will  be  shown  in  the  following  tests,  the  r.m.s.  error  in¬ 
creases  for  small  spheres  (because  noisy  information  on 
the  boundary  is  propagated  in  both  directions),  while 
for  larger  spheres  it  first  decreases  (for  the  averaging 
effect)  but  after  a  few  iterations  increases  (an  average 
over  very  large  neighborhoods  becomes  worse  than  the 
original  estimate)  with  a  speed  proportional  to  the  pa¬ 
rameter  a  in  the  cited  equations. 

The  graphs  in  Figure  7  show  the  behavior  of  the 
r.m.s.  error  as  a  function  of  work  units,  for  two  dif¬ 
ferent  values  of  the  sphere  radius  (55  and  95  pixels). 
Movement  is  an  expansion  (3  pixels  per  frame  on  the 
border  of  the  sphere).  Both  the  single  scale  and  the 
multiscale  algorithms  are  tested. 

For  the  smaller  sphere,  single  scale  relaxations  make 
the  error  worse.  The  multiscale  algorithm  does  not  im¬ 
prove  the  result.  The  r.m.s.  error  as  a  function  of  work 
units  in  not  monotonic  (see  graph),  and  the  last  fine 
scale  iterations  show  an  increasing  error.  The  effect  of 
the  boundary  is  present  in  particular  at  the  coarsest 
scale  because  the  ratio  boundary  /  internal  points  is 
large. 

For  the  larger  sphere  (the  sphere  boundary  is  now 
outside  the  visible  window  of  129x129  pixels)  the  situ¬ 
ation  is  different.  Single  scale  relaxations  improve  the 
r.m.s.  error  at  the  beginning.  The  error  reaches  its 
minimum  when  4  work  units  are  completed,  then  it  in¬ 
creases.  In  this  case  the  multiscale  approach  reduces 
the  error  faster  (the  minimum  b  reached  after  1.06  work 
units).  But  the  minimum  value  b  reached  on  the  mid¬ 
dle  scale  and  error  becomes  larger  on  the  finest  scale, 
another  case  in  favor  of  the  adaptive  approach.  The 
following  figure  shows  the  optical  field  obtained  with 
the  multbcale  algorithm  in  the  two  cases.  These  ex¬ 
amples  show  that  the  effect  of  the  boundary  conditions 
on  the  result  b  indeed  an  important  one.  Going  from 
an  exact  a  priori  knowledge  of  the  occluding  bound¬ 
ary  with  their  velocity  values  to  a  situation  where  the 
only  boundary  conditions  are  the  "free”  boundary  con¬ 
ditions  at  the  border  of  the  image,  leads  in  general  to 
very  different  results^ 

‘^Lee  et  al.  [14]  show  that  free  boundary  conditiona  make  the 
problem  il-eotiditiotied.  It  is  essential  to  stress  that  this  occurs 
when  the  spatial  derivatives  /,  and  Ig  are  very  small,  precisely 
the  ease  that  has  to  be  avoided  because  it  is  plagued  by  large 
errors.  If  the  derivatives  are  not  small,  ill-conditioning  is  not 
a  problem.  Dirichlet  boundary  conditions,  that  are  apparently 
suggested  in  the  cited  reference,  assume  the  precise  knowledge  of 
the  velocity  field  on  the  boundary  and  this  is  not  the  typical  case 
in  computer  vision. 


Figure  7:  Expanding  sphere;  r.m.s.  error  as  a  function  of  the 
amount  of  computation  in  the  multiscale  scheme.  Fig.(a);  small 
sphere  (radius=5S).  Fig.(b):  larger  sphere  (radius=95).  Single 
scale  results  {  o)  and  multiple  scale  results  (□).  Interpolation  to 
finer  scales  increases  temporarily  the  rjn.s.  error.  Algorithm  is 
terminated  after  the  given  number  of  work  units  because  r.m.s. 
error  is  increasing. 

If  an  exact  knowledge  of  the  occluding  boundary  in¬ 
formation  b  missing,  incorporation  of  the  boundary 
detection  step  described  in  |l]  b  essential  in  order  to 
avoid  smoothing  across  regions  corresponding  to  dif¬ 
ferent  moving  objects,  as  it  will  be  shown  in  the  next 
section. 

7.1  Occluding  Objects 

Thb  test  compares  the  result  obtained  with  or  without 
dbcontinuity  detection.  It  shows  that  the  optical  flow 
near  an  occluding  boundary  may  be  reconstructed  with 
large  errors  unless  the  smoothing  process  b  blocked  by 
line  processes. 

The  im^es  contain  two  spheres  of  different  sbes  (ra¬ 
dius  are  35  and  24  pixeb),  translating  with  velocities 
(0.0  ,  -1.0)  and  (0.8  ,  0.2)  against  a  natural  back¬ 
ground.  Their  reflectance  patterns  are  sinusoidal  grids 
{L  b  13.3)  of  a  different  intensity  range  mapped  onto 
them  using  polar  projection  (in  order  to  obtain  a  wide 
range  of  Fourier  components  in  the  different  regions 
of  the  spheres),  while  illumination  b  coming  from  a 
source  at  infinity  orthogonal  to  the  image  plane.  The 
parameters  for  the  dbcontinuity  detection  process  are 


301 


Figure  8;  Multiscale  optical  flow  for  spheres  of  different  sites. 
The  effect  of  the  sphere  boundary  on  the  result  is  visible  for  the 
smaller  sphere. 


described  in  [Ij. 

Figure  9  illustrate  the  optical  flow  obtained  with  the 
adaptive  multiscale  process,  using  6  relaxations  on  each 
of  three  scales.  The  first  image  shows  the  result  ob¬ 
tained  without  discontinuity  detection,  while  the  sec¬ 
ond  one  shows  the  result  when  a  discontinuity  detection 
step  has  been  done  every  2  relaxations. 

The  qualitative  results  are  confirmed  considering  the 
r.m.s.  error  in  the  optical  field  for  the  two  cases  |2]. 


8  Test:  Natural  Image 

The  images  used  for  this  test  show  a  pine  cone  moving 
in  the  upward  direction.  They  were  acquired  with  a  S- 
VHS  video  camera  and  a  Targa  frame  grabber.  Move¬ 
ment  was  executed  by  adjusting  a  tripod  sustaining  the 
object  by  0.25  cm  every  frame.  Measured  velocity  in 
pixels  is  1.6  pixel  /  frame.  Tests  have  been  done  for 
sets  of  three  images  taken  every  one,  two,  and  three 
frames.  The  average  velocity  (on  a  window  centered 
on  the  pine  cone)  obtained  with  the  homogeneous  mul¬ 
tiscale  algorithm  is  compared  with  that  obtained  with 
the  adaptive  version.  While  this  second  version  always 
produces  a  better  estimate,  the  difference  is  particu¬ 
larly  significant  for  large  motion  amplitudes,  as  shown 
in  Figure  10. 

In  this  case  the  fine  scale  derivative  information  is 


Figure  9;  Occluding  moving  spheres:  optical  flow  obtained 
without  (a),  and  with  the  concurrent  discontinuity  detection  pro¬ 
cess  (b). 


completely  erroneous.  This  is  recognized  by  the  adap¬ 
tive  scheme  that  freezes  the  solution  obtained  at  coarser 
grids. 
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Figure  10:  Test  images,  motion  field  at  different  scales,  and 
graph  of  average  velocity.  Velocities  obtained  with  the  homo¬ 
geneous  (□)  and  adaptive  (o)  methods  are  compared  with  the 
correct  velocity  (dashed  line). 
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Abstract 


Two  different  approaches  to  realistic  image  syn¬ 
thesis  are  Ray  tracing  and  radiosity.  Each  method 
falls  short  in  their  attempts  to  model  global  illumina¬ 
tion  present  in  most  environments.  A  more  general 
model  includes  the  specular  and  diffuse  reflection  of 
both  these  methods  but  the  combination  requires  pro¬ 
hibitive  computation. 

The  algorithm  presented  here  extends  the  standard 
ray  tracing  algorithm  by  adding  diffuse  rays,  while 
eliminating  unnecessary  radiosity  calculations.  The 
data  separation  used  here  is  based  on  the  heuristic 
that  light  rays,  as  well  as  shadow  rays  intersect  ob¬ 
jects  which  are  nearby  more  often  those  that  are  more 
distant.  Although  that  heuristic  seems  sound,  its  accu¬ 
racy  is  being  tested  by  monitoring  the  results  of  pro¬ 
cessing  a  large  number  of  different  scenes  of  various 
complexity. 

Keywords;  Radiosity,  Ray  tracing,  Parallel  graph¬ 
ics  algorithms 

Introduction 

Early  attempts  at  producing  realistic  computer  gen¬ 
erated  images  used  shading  methods  to  simulate  ef¬ 
fects  such  as  shadows,  specular  reflections  and  trans¬ 
parency.  Development  of  more  accurate  image  syn¬ 
thesis  techniques  essentially  focused  on  two  distinct 
methodologies,  ray  tracing  and  radiosity.  Ray  trac¬ 
ing  is  a  view  dependent  approach  to  image  synthesis 
which  can  accurately  account  for  the  effects  of  shad¬ 
owing  and  reflections  from  neighboring  surfaces.  The 
method  assumes  that  all  light  is  specularly  reflected 
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or  transmitted,  and  in  most  cases,  ignores  any  diffuse 
reflection.  Radiosity,  on  the  other  hand,  attempts  to 
account  for  a  phenomenon  that  ray  tracing  ignores, 
the  diffuse  inter-reflections  of  light  from  surfaces  in  the 
scene,  which  may,  in  fact,  provide  most  of  the  illuminar 
tion.  Radiosity  methods  assume  that  all  surfaces  are 
diffuse  reflectors  and  make  no  allowances  for  specular 
reflection. 

Recent  research  has  concentrated  on  different  ways 
of  combining  these  two  effects  to  create  an  even  more 
accurate  model  of  global  illumination  in  an  environ¬ 
ment.  Some  combination  methods  add  rays  of  diffusely 
reflected  light  to  a  ray  tracing  program,  while  reduc¬ 
ing  the  number  of  additional  rays.  Kayija  describes 
a  method[8]  by  which  the  number  of  rays  that  need 
to  be  traced  can  be  reduced  by  stochastic  sampling  of 
the  rays  in  the  most  important  directions.  Ward  et 
ai[13]  use  a  Monte  Carlo  technique  to  generate  contri¬ 
bution  values  of  indirect  illuminance  at  certain  strate¬ 
gic  points,  and  the  values  are  averaged  to  provide  val¬ 
ues  at  other  points. 

An  alternative  to  including  diffuse  reflection  in  a 
ray  tracing  progreun  is  to  add  a  specular  component 
to  the  radiosity  method.  This  approach  was  tried 
by  Immel[7]  using  a  very  large  system  of  equations. 
Wallace[ll]  used  a  two-pass  approach  including  a  first 
pass  to  compute  the  diffuse  component  and  a  second 
pass  which  is  essentially  ray  tracing.  The  weakness  of 
all  of  these  methods  is  the  still  the  prohibitive  amount 
of  computation.  In  addition,  in  the  approach  sug¬ 
gested  by  Immel[7]  there  is  a  strong  dependence  be¬ 
tween  the  total  number  of  equations  that  need  to  be 
solved  and  the  specular  reflectance  of  the  surfaces  in 
the  scene.  This  dependence  eliminates  the  essential 
strength  of  the  radiosity  approach,  independence  of 
the  number  of  equations  from  th<  surface  characteris¬ 
tics. 
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The  algorithm  presented  here  extends  the  standard 
ray  tracing  algorithm  by  adding  dilTuse  rays,  while 
eliminating  unnecessary  radioeity  calculations.  The 
technique  is  based  on  results  obtained  by  Cohen  et 
al[3]  who  adapted  the  radiosity  method  to  provide  for 
faster  image  generation  in  an  animation  environment. 
Cohen’s  results  suggest  that  light  emitting  and  pri¬ 
mary  reflectors  probably  produce  most  of  the  diffuse 
illumination  in  a  scene.  It  is  possible  to  differentiate 
between  specularly  eind  diffusely  reflected  rays  by  ad¬ 
justing  the  level  of  recursion  of  a  diffusely  reflected  ray 
so  that  it  will  not  continue  to  be  propagated.  These 
results  suggest  a  way  of  adding  a  diffuse  component  to 
a  ray  tracing  program  without  making  it  intractable. 
The  unavoidable  increase  in  computation  can  be  han¬ 
dled  effectively  by  the  use  of  parallel  processing. 

Ray  Tracing 

Ray  tracing  programs  for  the  hypercube  architecture 
were  previously  implemented  successfully  by  Salmon 
and  Goldsmith[9]  and  Benner[2].  One  of  the  most  time 
consuming  tasks  of  a  ray  tracing  program  is  the  com¬ 
putation  of  the  points  at  which  a  ray  intersects  objects 
in  the  scene.  One  technique  that  is  used  to  avoid  un¬ 
necessary  intersection  calculations  is  to  surround  each 
object  in  the  scene  with  a  tightly  fitting,  geometrically 
simple  volume.  Kay  and  Kayija[8]  suggested  enclosing 
sets  of  bounding  volumes  in  still  larger  bounding  vol¬ 
umes  so  that  intersection  with  large  pjirts  of  the  scene 
can  be  ruled  out  by  a  simple  intersection  test  with  a 
high  level  bounding  volume. 

Radiosity 

In  most  environments  there  may  also  be  illumination 
present  that  cannot  be  directly  defined  as  having  orig¬ 
inated  from  a  point  source,  as  is  usuedly  the  assump¬ 
tion  made  for  ray  tracing  analysis  of  a  scene.  Radiosity 
techniques  attempt  to  determine  precisely  how  diffuse 
surfaces  act  as  indirect  light  sources.  This  process  is 
based  on  methods  used  in  thermal  engineering  to  de¬ 
termine  the  exchange  of  radiant  energy  between  sur¬ 
faces.  In  order  to  calculate  the  amount  of  light  energy 
arriving  at  a  surface,  a  hypothetical  enclosure  is  con¬ 
structed  consisting  of  surfaces  that  completely  define 
the  illuminating  environment.  Inside  this  enclosure 
there  must  be  an  equilibrium  energy  balance. 

The  early  work  on  radiosity  assumed  that  all  sur¬ 
faces  of  the  enclosure  were  ideal  diffuse  reflectors,  ideal 
diffuse  light  emitters  or  a  combination  of  the  two[4].  If 


the  entire  environment,  including  the  enclosure,  is  sub¬ 
divided  into  small  enough  ’’patches”,  then  each  patch 
can  be  considered  to  be  of  uniform  composition,  with 
uniform  illumination,  reflection  and  emission  intensi¬ 
ties  over  the  surface.  The  total  radiant  energy  leaving 
a  surface  (its  radiosity)  consists  of  two  components  - 
emitted  and  reflected  radiation.  The  radiosity  of  one 
patch  can  be  expressed  by  the  equation: 

Bj  =  Ej  +  pjHj  ,  ( 1 ) 

where: 

Bj  =  radiosity  of  patch  j:  the  total  rate  of  radiant 
energy  leaving  the  surface,  in  terms  of  energy 
per  unit  time  and  per  unit  area.  {W/w?) 

Ej  =  rate  of  direct  energy  emission  from 
patch  j  per  unit  time  per  unit  area. 

Pj  =  reflectivity  of  patch  j :  fraction  of 

incident  light  reflected  back  into  enclosure. 

Hj  =  incident  radiant  energy  arriving  at 
patch  j  per  unit  time  per  unit  area. 

The  reflected  light  is  equal  to  the  light  leaving  ev¬ 
ery  other  surface  multiplied  by  both  the  fraction  of 
that  light  which  reaches  the  surface  in  question,  and 
the  reflectivity  of  the  receiving  surface.  Thus  Hj,  the 
incident  flux  on  patch  j,  is  the  sum  of  fluxes  from  all 
surfaces  in  the  enclosure  that  ”see”  j  (it  may  be  that 
patch  j  ’’sees”  itself  if  it  is  concave).  This  means  that 

N 

Hj  =  Bi  Fij  ,  (2) 

•=i 

where: 

Bj  =  radiosity  of  patch  i.  {W/m^) 

Fij  =  form  factor:  fraction  of  radiant  energy 
leaving  patch  i,  impinging  on  patch  j. 

These  equations  can  be  combined  to  give: 

N 

Bj^Ej+pj'^BiEj  (3) 

i=l 

where  N  is  the  number  of  surfaces  in  the  enclosure. 
Such  an  equation  exists  for  each  of  the  N  patches  in 
the  enclosure,  yielding  a  set  of  N  simultaneous  equa¬ 
tions.  Each  of  the  parameters  Ej,pj,  and  Fj  must 
be  known  or  calculated  for  each  patch.  The  EjS  are 
nonzero  only  at  surfaces  that  provide  illumination  to 
the  enclosure,  such  as  a  diffuse  illumination  panel,  or 
the  first  reflection  of  a  directional  light  source  from 
a  diffuse  surface.  Ihe  number  of  patches  does  not 
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depend  on  viewpoint  or  resolution  and  is,  therefore, 
usually  less  than  the  number  of  pixels  in  the  resulting 
image.  However,  there  are  no  provisions  for  a  specu¬ 
lar  reflection  component  in  the  calculation,  since  this 
component  is  view  dependent. 

Ray  Tracing  with  a  Diffuse  Component 

Light  that  is  diffusely  reflected  into  the  eye  may  have 
arrived  at  the  reflecting  surface  from  any  direction.  A 
complete  ray  tracing  solution  would  require  that  a  ray 
be  traced  in  each  one  of  these  directions  and  propa¬ 
gated  if  necessary.  However,  if  the  light  arriving  from 
a  particular  direction  is  not  coming  from  a  light  source, 
it  must  be  the  result  of  emission  or  reflection  from  an¬ 
other  surface.  If  this  surface  is  to  contribute  illumi¬ 
nation  of  any  significance,  it  will  probably  be  either  a 
light  emitter  or  a  primary  reflector.  It  is  unlikely  that 
it  will  be  diffusely  reflecting  light  that  it  has  received 
from  another  surface.  This  means  that  the  rays  from 
the  original  point  of  intersection  need  to  be  traced  only 
in  those  directions  in  which  another  surface  is  visible 
and,  for  all  except  the  mirror  direction,  should  not  be 
propagated  beyond  the  point  of  intersection  with  that 
surface. 

In  a  regular  ray  tracing  program  each  ray  that  is 
specularly  reflected  or  treinsmitted  is  traced  recursively 
until  either  the  level  of  recursion  reaches  a  predeter¬ 
mined  depth,  or  the  contribution  of  the  ray  is  deter¬ 
mined  to  below  a  specified  minimum.  It  is  possible  to 
differentiate  between  specularly  and  diffusely  reflected 
rays  by  adjusting  the  level  of  recursion  of  a  diffusely 
reflected  ray  so  that  it  will  not  continue  to  be  prop2i- 
gated. 


In  radiosity  methods  a  hemi-cube  or  full-cube  is 
used  to  determine  which  other  surfaces  can  be  "seen” 
from  a  given  point.  The  same  approach  is  used  here. 
The  geometry  of  two  patches  with  the  hemi-cube  used 
to  determine  the  fraction  of  diffusely  reflected  light 
reaching  each  other  is  shown  in  Figure  1.  This  cube 
will  henceforth  be  referred  to  as  the  direction  cube. 
Each  cell  on  the  surface  of  the  direction  cube  repre¬ 
sents  a  direction  from  which  light  can  reach  the  point 
at  its  center. 

Oftentimes,  the  simplest  way  to  determine  whether 
an  object  is  visible  in  the  direction  represented  by  a 
particular  cell  is  to  project  the  object  onto  the  face  of 
the  cube  that  contains  that  particular  cell.  If  the  cell 
is  contained  in  the  projection,  then  the  object  must  be 
visible  in  that  direction.  This  is  the  technique  that  is 
used  in  most  radiosity  programs,  which  divide  the  en¬ 
vironment  into  planar  patches  that  are  relatively  easy 
to  project.  However,  if  an  actual  point  of  intersection 
with  the  visible  object  is  also  required,  a  more  effi¬ 
cient  method  for  computation  of  both  visibility  and 
intersection  point  can  be  found  in  the  various  fast  ray 
tracing  intersection  techniques  that  have  been  devel¬ 
oped.  The  fastest  algorithm  available  is  that  devel¬ 
oped  by  Kay  and  Kajiya[8]  and  is  the  one  used  in  our 
combined  algorithm. 

The  more  accurate  lighting  model  which  accounts 
for  both  ray  tracing  and  radiosity  is  expressed  as  shown 
in  Equation  (4).  Optimization  of  the  integral  calcula¬ 
tion  is  accomplished  using  the  recent  results  for  inter¬ 
surface  projections  [1],[10],[12]. 

= 

EiOout)  +  f  p"(0<.uf.6m)  /in(©in)  C08{Q)du  (4) 
Jn 

where: 

lout  =  the  outgoing  intensity  for  the  surface 
lin  zz  an  intensity  arriving  at  the  surf^u:e  from 
the  environment 

E  =  outgoing  intensity  due  to  emission  by  the 
surface 

@out  =  outgoing  direction 
Qin  =  the  incoming  direction 
0  =  the  sphere  of  incoming  directions 

6  =  the  angle  between  the  incoming  direction 

^md  the  surface  normal 

du  =  the  differential  solid  angle  through  which 
the  incoming  intensity  arrives 
p"  =  the  bidirectional  reflectance/transmittance 
of  the  surface 
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The  term  p"  in  the  equation  is  broken  into  two 
components  for  the  specular  and  diffuse  components 
and  is  given  by; 

P  (0otit)6«n)  —  ^»P*(6ou<|0in)  d"  fcdPd 

where 

k,  =  fraction  of  reflectance  that  is  specular 

kd  =  fraction  of  reflectance  that  is  diffuse 

k,  +  kd  —  1. 

Hypercube  Implementation  of  the  Model 

A  basic  assumption  of  the  approach  to  image  syn¬ 
thesis  (that  follows)  is  that  the  scene  is  described  in 
the  Haine8[5]  suggested  standard  format  for  a  graphics 
database  (nff)  and  that  the  program  will  include  code 
that  can  read  a  scene  file  of  this  type.  This  implies 
that  the  scene  must  be  composed  of  a  combination  of 
well  defined  primitives  -  spheres,  cones,  cylinders  and 
polygons.  Only  point  light  sources  are  included  in  the 
Eric  Haines  data  base.  Since  light  emitting  surfaces 
are  to  be  included,  an  extra  field  is  added  to  the  ob¬ 
ject  description. 

The  image  is  divided  into  horizontal  bands,  num¬ 
bered  from  0  to  M  —  1,  each  of  which  will  be  assigned 
to  a  different  processor.  Let  a  pixel's  coordinates  be 
called  (/,,  /y).  Let  the  number  of  rows  J  mapped  onto 
each  processor  be  the  number  of  rows  in  the  image, 
Xsize,  divided  by  the  number  of  processors,  M.  To 
determine  which  processor  I,  a  given  row  ly  is  mapped 
onto,  a  ring-mapping  is  used  so  that  I  =  ly  div  J . 
If  the  scene  is  projected  onto  the  image  using  a  per¬ 
spective  projection,  each  pixel  band  will  represent  one 
part  of  the  scene.  The  view  volume  for  the  entire  im¬ 
age,  is  thus  subdivided  into  smaller  volumes,  one  for 
each  band. 

Object  assignment  depends  on  the  dimensions  of 
the  object’s  bounding  slabs.  A  dmin  and  (Lnas  in  each 
of  the  X,  y  and  z  directions,  is  computed  for  each  ob¬ 
ject  as  the  scene  file  is  read.  These  define  the  object’s 
bounding  volume  which  is  stored  with  other  object 
data  such  as  shape  and  surface  characteristics.  Inter¬ 
section  of  a  band’s  view  volume  with  the  bounding 
slabs  of  an  object  determines  that  the  object  has  an 
effect  on  that  band.  The  result  is  that  the  object’s 
data  is  stored  on  the  processor  associated  with  that 
band.  Determination  of  which  processor  an  object  is 
assigned  to,  is  described  below. 

C  =  eye  location  (also  center  of  projection). 

R  =  view  reference  point. 

UP  =  view-up  direction. 


d  =  view  distance. 

N  =  R  -  C  (normalized). 

V  =UP-(NQUP)N 

(vertical  axis  in  view  plane). 

U  =  N®V 

(third  axis  of  viewing  coordinate  system). 

/  =  scalar  offset  for  relative  position  of  It  and  ly . 

1.  Project  each  corner,  E,  of  the  object’s  bound¬ 
ing  volume  onto  the  view  plane  by  solving  the 
set  of  three  parametric  equations: 

E  =  C^-i*d*N  +  f{It)  *U  +  f{Iy)*V 
for  the  three  unknowns;  i,  f{Ix),f{Iy)- 

2.  Obtain  the  vertical  image  coordinate  of  the 
pixel,  onto  which  this  corner  projects,  from 
the  value  of  /(/y ),  which  is  a  linear  expression 
in  ly. 

3.  Determine  the  maximum  (lyjnax)  and  the 
minimum  (/y_m(n)  of  these  vertical  image  co¬ 
ordinate  values  for  all  8  corners  of  the  bound¬ 
ing  volume. 

4.  Assign  the  object  to  all  processors  between 
Iy.mindivJ  and  lyjnaxdivJ . 

Each  object  in  the  scene  is  assigned  to  one  or  more 
pixel  bands,  and  thus  to  the  processor  responsible  for 
that  pixel  band,  depending  on  the  dimensions  of  its 
bounding  slabs.  This  moping  process  is  shown  in 
Figure  2. 


Figure  2.  Mapping  Process 
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When  more  than  one  object  is  assigned  to  a  par¬ 
ticular  processor,  the  tree  of  bounding  volumes  within 
bounding  volumes  that  was  described  earlier  is  used 
for  storage.  It  is  constructed  in  a  manner  first  sug¬ 
gested  by  Kay  and  Kajiya[8]  that  groups  objects  by 
nearness  as  much  as  possible. 

A  ray  is  traced  in  its  processor  until  no  more  in¬ 
tersections  with  objects  occur  or  a  specified  maximum 
depth  of  recursion  is  reached.  If  the  maximum  is  n^t 
reached  and  the  ray  passes  into  the  space  of  another 
processor,  then  it  is  added  to  a  list  of  rays  to  be  passed 
to  that  processor.  The  hypercube  ray  tracing  algo¬ 
rithm  with  radiosity  effects  included  is  given  below. 

Diffuse  Ray  Tracing  Algorithm 

Level=0 

Weight  =  1 

Form  ray  with  origin  at  eye  and  direction 
through  pixel 

Intersect  ray  with  everything  in  this  processor  / 

If  intersection  occurs 

Find  nearest  point  of  intersection  P 
Find  Normal  N  to  surface  at  P 
Shade(/,  Level,  Weight,P,iV,raj/<iir.hit,  color) 
Mark  direction  cells  at  P  above  or  below 
For  each  cell  marked  as  above  surface 
Compute  direction,  cello 
Create  cubcray  with: 
origin  =  P 
direction  =  cellp 
weight  =  N  O  cello  x 
level  =  maxlevel  -  1 
label  =  this  pixel 
Trace  (  I,cuberay,  tcol  ) 
color  =  color  -f-  tcol  x(N  O  cello)  x  Kd 
EndFor 

Else 

color  =  bgcolor 

Endlf 

The  TVace  algorithm  is  modified  for  the  hyper¬ 
cube  from  that  described  by  Heckbert[6].  When  a  ray 
is  passed  to  a  neighboring  processor  it  takes  with  it 
information  on  ray  origin,  direction,  weight,  level  and 
image  coordinates  of  the  pixel  to  which  this  ray  con¬ 
tributes. 

Time  will  be  wasted  if  the  processor  from  which 
the  ray  originated  suspends  processing  until  it  receives 
a  response  from  the  processor  to  which  the  ray  was 
passed.  If  multi-tasking  were  available,  the  processor 
could  commence  processing  the  next  pixel  while  aw2ut- 
ing  a  reply.  The  was  the  solution  chosen  by  Salmon 
and  Goldsmith  [9]  for  their  hypercube  ray  tracer,  but 


they  had  to  implement  a  system  of  multi-tasking  to 
accomplish  this.  We  used  an  alternative  method  in 
which  a  list  of  rays  that  need  more  tracing  is  collected 
and  then  passed  as  a  group  to  a  neighboring  proces¬ 
sor.  If  tracing  is  completed  in  some  other  processor, 
that  processor  needs  to  know  to  which  pixel  this  ray 
contributes  color.  Rays  not  traced  to  their  full  depth 
of  recursion  pass  information  gathered  up  to  point  to 
the  next  processor  as  part  of  the  ray  description. 

In  the  modified  version  of  Shade,  a  color  value  is 
initially  computed  based  on  the  assumption  that  there 
are  no  shadows.  A  trial  shadow  ray  is  then  be  created 
and  tested  first  for  local  shadows  by  intersecting  it 
with  all  objects  assigned  to  the  current  processor.  If 
an  intersection  is  found,  the  color  previously  assigned 
is  removed.  If  there  is  no  intersection  with  any  local 
object,  the  shadow  ray  may  be  passed  to  neighboring 
processors  for  further  testing.  Further  testing  is  not 
necessary  when  the  shadow  ray  reaches  the  light  source 
before  it  leaves  the  space  of  the  current  processor. 

Computing  the  point  of  intersection  of  the  shadow 
ray  with  the  plane  separating  the  spaces  of  two  dif¬ 
ferent  processors  is  done  as  follows:  Given  that:  P  = 
origin  of  shadow  ray,  L  =  direction  of  shadow  ray,  and 
a  general  point  on  the  Une  is  Q{t)  =  P  + 1  *  L.  QH) 
is  in  the  plane  separating  processor  /  from  processor 
/  —  1  if  the  line  joining  Q  to  C  (center  of  projection) 
lies  entirely  in  the  plane  and  is  therefore  perpendicu¬ 
lar  to  the  plane  normal.  Therefore:  (P  +  t*  L  —  C)0 
{U  ^  d  *  N  +  KiV)  =  0  can  be  solved  to  obtmn  t  at 
intersection.  This  value  of  t  is  compsued  to  the  dis¬ 
tance  from  the  shadow  ray  origin  to  the  light  source 
in  order  to  determine  whether  to  pass  the  shadow  ray 
to  processor  7  —  1. 

The  shadow  ray  descriptor  that  is  passed  to  other 
processors  for  testing  will  have  essentially  the  same 
format  as  the  ray  descriptors  passed  for  tracing,  ex¬ 
cept  that  its  direction  is  not  normalized,  and  so  con¬ 
tains  the  essential  information  about  the  distance  of 
the  light  from  the  intersection  point.  If  it  is  found  in 
subsequent  testing  that  an  object  assigned  to  another 
processor  blocks  the  shadow  ray,  then  the  contribution 
of  that  light  source  will  be  subtracted  from  the  final 
color  value  of  the  pixel. 

After  a  processor  completes  its  own  band  of  pixels, 
it  receives  and  processes  the  lists  of  rays  which  en¬ 
tered  its  space  from  neighboring  processors.  Depend¬ 
ing  on  the  depth  of  recursion  for  the  diffusely  reflected 
rays,  further  processing  may  be  required  in  another 
node.  Finally  all  the  pixels,  having  been  assigned  a 
final  value,  are  merged  into  a  single  image  file  on  the 
host  for  display. 
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Discussion 

Two  different  approaches  to  realistic  image  synthe¬ 
sis  are  Ray  tracing  and  radiosity.  Each  method  falls 
short  in  their  attempts  to  model  the  global  illumina^ 
tion  present  in  most  environments.  A  more  general 
model  includes  the  specular  and  diffuse  reflection  of 
both  these  methods  but  the  combination  requires  pro¬ 
hibitive  computation.  One  way  to  reduce  compute 
time  is  to  use  parallel  processing  but  this  alone  is  not 
enough.  The  approach  suggested  here  is  to  add  a  dif¬ 
fuse  component  to  each  light  ray  while  still  exploiting 
the  time  savings  of  the  progressive  refinement  tech¬ 
niques  of  Cohen  et  al  [3].  In  addition,  a  hypercomputer 
mapping  of  the  algorithm  is  presented. 

The  data  separation  used  here  is  based  on  the  heuris¬ 
tic  that  light  rays,  as  well  as  shadow  rays  intersect  ob¬ 
jects  which  are  nearby  more  often  those  that  are  more 
distant.  Although  that  heuristic  seems  sound,  its  accu¬ 
racy  is  being  tested  by  monitoring  the  results  of  pro¬ 
cessing  a  large  number  of  different  scenes  of  various 
complexity.  Another  aspect  of  ray  tracing  on  a  dis¬ 
tributed  system  is  the  issue  of  load  balancing.  At  each 
processor,  the  performance  of  our  combined  algorithm 
depends  upon  the  number  of  diffusely  reflected  rays 
present  locally  in  the  scene.  Bounding  slabs  can  be 
used  to  balance  the  number  of  objects  in  each  proces¬ 
sor,  thus  balancing  the  time  needed  for  the  ray  tracing. 
This  approach  is  currently  being  pursued. 
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Introduction  to  Ray  Tracing 

Ray  tracing  is  presently  the  method  of  dioioe 
for  generating  the  most  realistic  looking  synthetic 
images.  Ray  tracing  works  by  tracing  the  paths  of 
many  rays  of  light  backwards  from  the  viewpoint, 
through  pixels  on  an  imaginaiy  viewplane,  and  into 
an  environment  of  mathematically  defined  3- 
dimensional  solids  (objects)  called  the  "scene". 
These  rays  are  called  primary  rays.  When  a  primary 
ray's  intersection  point  with  tte  closest  object  in 
the  scene  is  found,  a  shading  model  is  applied  at 
that  point,  and  the  coneqionding  pixel's  brightness 
(cdor)  is  calculated.  Often,  the  shading  model  will 
require  that  secondary  rays  be  traced  due  to  reflection 
or  refraction.  Also,  the  shading  model  will  fire  a 
ray  toward  each  light  source  in  the  scene  to  see  if 
the  intersection  point  is  in  that  light  source's 
shadow.  These  rays  are  called  shadow  rays.  Much 
work  has  been  done  by  others  to  optimize  various 
parts  of  the  ray  tracing  algorithm,  such  as  ray-object 
intersection.  [1,  5,  2] 

Scope  and  goals  of  research 

Very  little  work  has  been  done,  however,  in  the 
area  of  parallel  ray  tracing.  Ray  tracing,  until  now, 
was  largely  confined  to  serial  machines.  Thus,  one 
purpose  of  the  Hypercube  Ray  Tracer  is  to 
demonstrate  that  ray  tracing  is  well  suited  to  parallel 
architectures,  and  exhibits  excellent  speedup. 
Another  purpose  of  the  Hypercube  ray  tracer  {Moje^ 
is  to  select  and  develop  algorithms  and  data 
structures  which  lend  themselves  well  to  the 
parallel,  distributed  computing  environment  of  the 
iPSC/2. 

Parallel  issues  in  ray  tracing 

A  little  thought  will  disclose  that  the 
brightness  of  each  pixel  on  the  viewplane  is 
completely  independent  of  its  neighbors.  This  is 
not  to  say  there  is  no  correlation  among  pixels, 
clearly  there  is,  but  merely  the  calculations 
performed  for  pixels  are  independent.  This  fine 
grained  parallelism  makes  ray  tracing  suitable  for 
coarse,  medium,  and  fine  graued  parallel  machines. 

Furthermore,  since  all  pixel  calculations  are 
independent  of  one  another,  there  need  be  no 
communication  between  processing  eluents  during 


the  rendering  process  itself.  This  observation 
makes  ray  tracing  equaUy  attractive  to  both  diared- 
and  distributed  memory  architectures  from  an 
interprooessor  communications  standpoint.  One  can 
envision  ray  tracing  schemes  which  do  perform 
interprocessor  communications  for  one  reason  or 
anotter.  Some  of  these  schemes  will  be  discussed 
later  in  this  paper. 

There  is  one  key  data  structure  that  drives  the 
ray  tracing  algorithtn:  the  database  describing  the 
objects  to  be  rendered.  This  so  called  "object 
database”  can  become  very  large  if  the  number  of 
objects  to  be  tendered  becomes  large.  Since  objects 
can  be  reflective  and  refractive,  secondary  rays  and 
shadow'  rays  can  intersect  any  object  in  the  scene. 
For  this  reason,  each  processor  most  have  access  to 
the  entire  object  database  when  tetKiering  a  given 
scene.  Though  this  is  not  usually  a  problem  for 
shared  memory  parallel  processors,  it  can  be  for 
distributed  memory  machines  because  of  the 
generally  limited  memory  available  to  individual 
processors.  This  issue,  too  will  be  discussed  later 
in  tins  prqrer. 

Problem  decomposition  for  the  iPSC/2 

The  ray  tracing  algorithm  itself  may  be 
decomposed  in  many  ways  onto  the  hypercube 
topology  of  the  iPSC/2.  One  might  qrt  for  the 
simplest  qrproacb  of  pladng  a  complete  ray  tracer 
on  each  node.  Each  node  then  ray  traces  a  fixed 
portion  of  the  pixels  on  the  viewplane.  This 
method  has  the  distinct  advantage  of  simplicity.  A 
little  thought  will  reveal  that  .some  portions  of  an 
image  may  take  longer  to  calculate  than  others. 
This  will  be  the  case  for  pixels  whose  primary  rays 
intersect  complex,  mirrored,  or  refractive  objects. 
Thus,  one  no^  may  take  much  longer  to  ray  trace 
its  fixed  portion  tlwi  other  nodes.  This  leads  to 
poor  tKxle  utilization  and  poor  load  balance. 

In  a  diflerent  scheme  one  might  divide  the 
processing  nodes  into  "intersection  processors"  and 
"shading  processors"  with  the  former  doing  all  ray- 
object  intersections  and  the  latter  performing  all 
shading  calculations  on  intersection  points  found  by 
the  intersection  processors.  [3]  Since  all  iPSC/2 
nodes  ate  the  same,  there  is  no  reason  to  believe 
that  some  nodes  could  perform  one  function  better 
than  others,  nirthetmote,  the  intersection  process 
is  far  mote  time  consuming  than  the  shading 
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process,  so  any  time  saved  in  shading  calculations 
would  make  very  little  difference  in  overall 
performance.  This  method  has  the  further 
disadvantage  of  requiring  a  considerable  amount  of 
internode  communications  between  intersection 
processors  and  shading  processors.  Efficiency 
would  suffer  because  of  this  commutiication  time. 

The  Hypercube  Ray  Tracer  uses  a  variation  on 
the  first  alternative  presented  above;  a  complete  ray 
tracer  is  placed  on  each  node.  It  solves  the  load 
balance  problem  by  dividing  the  image  in  such  a 
way  that  all  nodes  are  likely  to  share  in  (be  more 
difficult  portions  of  the  image.  We  start  at  the  top 
raster  of  the  image  and  move  down,  assigning 
rasters  to  successive  nodes.  When  all  nodes  have 
been  assigned  a  raster,  then  assignment  restarts  with 
the  frrst  node  and  continues  until  all  rasters  have 
been  assigned.  This  is  called  the  "comb" 
distribution  because  the  rasters  associated  with  a 
given  node  look  like  the  teeth  of  a  comb. 
Experiments  show  this  image  decomposition  to 
produce  very  low  load  imbalances  in  images  of 
medium  to  large  size  with  an  arbitrary  number  of 
objects  in  (be  scene. 

Implementation  issues 

There  are  a  number  of  other  isstres  which  need 
attention  in  the  parallel  envirorunent  of  the  iPSC/2. 
These  are  the  structure  and  content  of  (he  object 
database,  the  ray-object  intersection  method,  and 
antialiasing  methods.  Several  choices  are  available 
for  each  topic,  and  are  discussed  in  turn. 

First,  we  (urn  our  attention  to  (be  stnicture  of 
the  object  database  and  ray-object  intersection 
methods.  These  subjects  are  closely  tied  togetlier 
and  should  be  considered  simultaneously.  The  way 
in  which  the  database  is  organized  and  queried  is  of 
critical  importance  to  (he  speed  of  ray-object 
intersection.  And  since  ray-object  intersection  is 
the  major  consumer  of  time  in  the  ray  tracing 
process,  database  organization  is  equally  critical  in 
overall  rendering  speed. 

There  are  two  general  classes  of  ray-object 
intersection  acceleration  metliods,  and  each  dictates  a 
database  structure.  The  two  classes  are  object 
subdivision  [SJ  and  space  subdivision  [1, 2j.  Object 
subdivision  methods  usually  organize  the  databa.se 
into  a  hierarchy  while  space  subdivision  methods 
divide  the  database  into  a  number  of  smaller 
databases.  The  object  subdivision  metliod  of  [S]  has 
the  advantage  of  keeping  (he  database  in  one  piece. 
Since  this  one-piece  structure  will  aid  in  later 
database  distribution,  we  chose  it  over  competing 
object  subdivision  techniques. 

Each  pixel,  and  thus  each  node,  must  have 
access  to  the  entire  object  database.  This  does  not 
mean  (hat  each  node  must  have  a  copy  of  the 
database,  just  access  to  it.  Indeed,  if  each  node 


stores  a  copy  of  the  database,  a  tremendous 
aggregate  amount  of  memory  will  be  wasted  storing 
these  multiple  copies.  If,  on  (he  other  hand,  a  node 
must  often  or  frequently  request  objects  from  other 
nudes,  then  a  tremendous  amount  of  time  will  be 
wasted  waiting  for  objects  to  arrive.  In  the 
distributed  memory  environment  of  the  iPSC/2, 
this  tradeoff  has  serious  implications:  speed  must 
be  traded  for  larger  database  sizes.  In  this  first 
implementation  of  tlie  Hypercube  Ray  Tracer,  speed 
of  execution  and  ease  of  implementation  are  more 
important  than  a  large  database  size.  This  decision 
is  tempered  with  one  proviso,  however.  All  data 
structures  and  algorithms  must  be  suitable  for  use 
with  the  diirtributed  database  concept  so  that  future 
expansion  will  be  simplified  as  much  as  possible. 

finally,  we  shaU  consider  several  antialiasing 
techniques.  Aliasing  in  synthetic  images  manifests 
itself  in  several  forms.  First,  and  best  known  of  all 
aliasing  modes,  is  the  "jaggies"  —  those  jagged 
edges  at  the  edges  of  an  object's  image  or  shadow. 
Not  so  well  known  are  the  teat  patterns  and  moire' 
patterns  that  appear  when  an  area  of  high  ^atial 
frequency  is  sampled  by  the  ray  tracing  algorithm. 
All  of  these  problems  are  caused  by  sampling  on  a 
regular  pixel  grid.  Antialiasing  is  very  important  to 
realism  because  it  smooths  out  jagged  ^ges  and 
otter  artifacts  in  the  image. 

Again,  there  are  several  methods  to  choose 
from.  Statistical  pixel  subsampling  techniques  are 
most  attractive  in  terms  of  ease  of  implementation. 
[6]  These  methods  combat  aliasing  effects  by 
sampling  a  pixel's  area  more  than  once  in  a 
nonuniform  pattern.  Once  several  samples  are  taken 
from  the  pixel,  a  statistical  test  such  as  a  variance 
threshold  is  applied  to  the  samples  to  detemiine  if 
more  samples  are  needed.  When  enough  samples 
are  accumulated,  (bey  are  averaged  in  some  way  to 
produce  a  representative  brightness  for  the  pixel  in 
question.  Most  statistical  subsampling  techniques 
require  knowlege  of  brightness  levels  in  only  a  one 
pixel  neighborhood.  This  is  attractive  in  the 
distributed  environment  of  the  iPSC/2  because 
rendered  image  data  is  distributed  among  nodes. 
Thus,  any  methods  which  require  the  brightness  of  a 
neighboring  pixel  may  have  to  get  it  from  another 
node  —  a  relatively  slow  process  on  (he  iPSC/2. 
Here  again,  we  choose  speed  and  ease  of 
implementation  as  more  important  to  the 
Hypercute  Ray  Tracer. 

Other  antialiasing  methods  rely  on  filtering, 
adaptive  pixel  subsampling,  or  a  combination  of 
both.  [4,  8]  All  of  these  methods  present  problems 
because  they  assume  the  immediate  availability  of 
any  pixel's  brightness.  As  stated  above,  this  is  not 
always  the  case  in  a  distributed  computing 
environment.  These  methods  are  rejected  for  this 
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reason  in  this  first  implementation  of  the 
Hypercube  Ray  Tracer. 

One  feature  of  the  Hypercube  Ray  Tracer  that 
has  not  received  a  lot  of  research  is  the  process  of 
constructive  solid  geometry  (CSG).  [9]  This  is  the 
process  of  defining  new  object  types  from 
previously  existing  ones  by  means  of  boolean 
operations.  For  example,  a  box  with  a  hole  in  it 
can  be  constructed  by  subtracting  a  cylinder  from 
the  box.  (Box  and  not(CyliDder))  The  Hypercube 
Ray  Tracer  includes  an  original  ray-CSG  object 
intersection  scheme  which  we  have  developed  on 
top  of  Kay’s  intersection  scheme.  It  performs  the 
intersection  in  linear  time  with  the  number  of 
objects  defining  the  CSG  object. 

Late  in  the  Hypercube  Ray  Tracer's 
construction,  a  decision  had  to  be  made  between 
implementing  CSG  or  distributing  the  object 
database.  After  much  thought,  we  decided  that  CSG 
would  have  a  profound  impact  on  the  object 
database  organization.  It  then  made  sense  to 
implement  CSG  first.  This  way,  any  changes  in 
the  database  organization  could  tw  easily  taken  into 
account  when  the  time  came  to  distribute  it. 

Results 

All  test  images  were  calculated  at  512  x  512 
resolution  on  32  scalar-enhanced  nodes  with  no 
antialiasing.  Rgures  1  and  2  show  the  amount  of 
time  taken  to  render  each  test  image  on  different 
numbers  of  nodes.  Tables  1  and  2  show  the  same 
data  as  well  as  speedup  and  efficiency  for  the  same 
configurations.  Here,  efficiency  is  defined  as  the 
speedup  over  the  number  of  nodes  used  for  a  given 
problem. 

The  Hypercube  Ray  Tracer's  antialiasing 
strategy  greatly  increases  the  visual  realism  of  its 
rendered  images.  Jagged  edges  and  razor  sharp  lines 
are  no  longer  a  problem.  A  rather  substantid  time 
penalty  is  paid  for  antialiasing,  though.  The  final 
method  used  casts  eight  uniformly  random  rays 
throu^  each  pixel  and  calculates  the  variance  of  the 
average  intensity.  If  the  variance  is  above  a 
predetermined  threshold,  groups  of  four  additional 
rays  are  traced  until  the  variance  drops  below  the 
threshold  or  until  32  rays  have  been  traced, 
whichever  comes  first.  Thus,  each  pixel  is  8  to  32 
times  more  expensive  to  compute  than  without 
antialiasing. 

Future  Work 

Many  tradeoffs  have  been  made  in  the  design  of 
the  Hypercube  Ray  Tracer.  The  most  notable  of 
these  is  the  choice  to  maintain  a  complete  copy  of 
the  object  database  on  each  node  of  the  iPSC/2. 
This  one  decision  greatly  sped  program  develqment 
but  severely  limited  the  maximum  numtrer  of 
objects.  New  techniques,  which  are  already  under 


development,  will  allow  the  database  to  be 
distributed  across  the  nodes  with  only  a  modest 
performance  penalty. 

The  rather  recent  development  of  "distributed 
ray  tracing''  [6]  adds  new  and  exciting  effects  to  ray 
traced  images  such  as  penumbrae,  distributed  light 
sources,  and  frosted  glass.  Note  that  distributed  ray 
tracing  should  not  be  confused  with  the  object 
database  distribution  proposed  above.  Rather,  it 
refers  to  the  small  random  perturbations  of  rays  used 
to  achieve  the  aforementioned  effects.  Distributed 
ray  tracing,  too,  is  well  suited  to  the  distributed 
computing  envirorunent  for  the  same  reason  simple 
ray  tracing  is:  the  ray  calculations  are  irxfependent  of 
one  another. 

Relatively  simple  object  models  are  used  in  the 
Hypercube  Ray  Tracer.  They  include  spheres, 
cylinders,  cubes,  polygonal  prisms,  and  convex 
superquadric  ellipsoids.  New  shiqres  such  as  bicubic 
patches,  superquadric  toroids,  and  swept  cubic 
curves  would  greatly  add  to  the  usability  of  the 
Hypercube  Ray  Tracer. 

Summary 

Ray  tracing  is  a  complex  rendering  technique 
which  has,  until  now,  been  almost  exclusively 
confined  to  serial  computers.  The  Hypercube  Ray 
Tracer  efficiently  takes  the  rendering  technique  into 
the  parallel  domain  with  considerable  time  savings 
and  very  nearly  linear  speedup.  Ray  tracing  is  thus 
shown  to  be  well  suited  to  distributed  memory 
parallel  architectures. 

Leading  object  intersection  algorithms  and  data 
structures  are  chosen  and  modified  to  be  efficient  said 
expandable  in  the  parallel  environment. 
Antialiasing  techniques  are  discussed  and  iqiplied  in 
paraUel  to  the  problem.  The  ramifications  of  object 
database  duplication  are  discussed  at  length  and 
consideialions  are  made  for  future  distribution  across 
the  iPSC/2. 

The  problem  of  load  balancing  is  discussed,  and 
a  static  load  balsmcing  scheme  based  on  the  "comb" 
image  decomposition  is  offered  as  a  primary 
solution.  Results  are  presented  to  confirm  that  the 
"comb"  decomposition  is  a  very  effective  one  with 
file  current  repertoire  of  object  database  organizidion 
and  object  intersection  algorithms. 
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Table  1:  PerformaiKe  Data  for  Self  Portrait  Image 
_ (1381obiects) _ 
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Table  2:  Performance  Data  for  Superquadric  Grid  Image 
(93  objects) _ 
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Abstract 

A  medium-grained,  distributed-memory  parallel 
computer  is  used  as  a  platform  from  which  to  research  a 
specific  issue  in  accelerating  the  ray-tracing  ptxxress. 
The  work  presented  here  deals  specifically  with  the 
problem  of  distributing  large  object  databas^  over  the 
computing  nodes  of  the  Intel  iPSC/2.  An  efficient 
object  database  decomposition  method  is  presented 
which  behaves  like  a  fully-associative  cacne. 
Ramifications  of  the  distributed  object  database  forced 
the  development  of  an  interruptible  ray-tracing  loop,  a 
ray  scheduler,  and  a  dynamic  viewplane  decomposition. 
Performance  metrics,  such  as  object  database  hit  ratio, 
load-balance,  and  efUcieiKy  are  presented. 

Introduction 

Ray-tracing  is  a  realistic  image  synthesis  technique 
which  produces  superior  quality  images  by  consuming 
superior  amounts  of  computer  time!  Although  much 
work  has  been  done  to  sp^  the  ray-tracing  process,  it 
still  remains  one  of  the  most  expensive  image  synthesis 
techniques  in  terms  of  computo'  time  used  [1,  3, 4,  5, 
8,  9]. 

The  ray-tracing  process,  is  deceptively  simple. 
Parallelizing  the  ray-tracing  process  on  a  distributed- 
memory  parallel  computer  also  seems  simple  at  first 
glance.  It  has,  however,  several  pitfalls  which  are  well 
hidden.  One  such  pitfall  is  the  problem  of  distributing 
the  object  database  (ODB)  across  the  nodes  of  the 
compute  without  seriously  affecting  the  performance  of 
the  already  power-hungry  algorithm.  Both  the  object 
database  decomposition  and  the  problems  associated 
with  it  will  be  di^ussed  in  subsequent  sections. 

Before  launching  into  this,  however,  let  us  Hrst 
briefly  review  the  ray  tracing  algorithm.  The  ray- 
tracing  milieu  consists  of  an  observer,  a  viewplane,  and 
a  set  of  objects  called  the  scene.  The  observer  is  a  point 
in  space  from  whose  perspective  the  scene  is  to  be 
rendered.  The  viewplane  is  an  imaginary  rectangle 
through  which  the  observer  views  the  scene.  The 
viewplane  is  divided  into  a  grid  of  pixels.  It  is  the  task 
of  the  ray-tracing  procedure  to  find  the  light  intensity 
present  at  each  pixel.  The  scene  is  composed  of  a 
(potentially  large)  number  of  three-dimensional 
geometric  Hgures  called  primitives.  Primitives  can  be 
as  simple  as  a  sphere  or  cube,  or  as  complex  as  a  fractal 
mountainside.  Any  light  reaching  the  observer  through 


a  given  pixel  must  have  come  from  the  direction  along 
a  ray  from  the  observer  to  the  pixel  in  question.  If  one 
traces  backward  along  this  line  of  propagation  into  the 
scene,  the  surface  from  which  the  light  was  scattoed  can 
be  discovered. 

Since  the  physical  properties  of  the  surface  are 
known,  we  can  model  the  way  it  scatters  light  The 
mathematical  model  used  is  called  a  shading  model. 
Although  the  ray  may  intersect  several  objects  in  the 
scene,  only  the  intersection  point  closest  to  the  observer 
is  relevant  Rays  cast  from  the  observer  through  the 
pixels  are  called  primary  rays.  The  brightness  of  each 
pixel  on  the  viewplane  is  completely  independent  of  its 
neighbors.  Clearly,  the  intensity  between  adjacent 
pixels  is  highly  correlated,  but  the  calculations 
themselves  are  independent  Pixel  independence  gives 
ray  tracing  the  fine-grained  parallelism  that  makes  it 
suitable  few  implementation  on  frne-,  medium-,  and 
coarse-grained  parallel  computers. 

The  intensity  correlation  between  adjacent  pixels 
makes  a  distributed  ODB  feasible  in  an  indirect  way.  If 
primary  rays  are  traced  through  the  viewplane  in  a 
spatially  coherent  manner,  then  the  ODB  references 
generate  will  have  a  high  degree  of  temporal  and  spatial 
locality  within  the  ODB.  This  observation  follows 
naturally  from  the  functioning  of  the  ray-tracing 
algorithm  and  the  Kay  ray-ODB  intersection  process  [9]. 
This  high  degree  of  lo^ity  is  just  the  property  that 
makes  an  ODB  cache  feasible. 

Though  medium-grained  distributed-memory 
parallel  computers  do  not  have  sufficient  memory  per 
computing  node  to  store  very  large  ODB's,  all  is  not 
lost  The  database  can  be  broken  up.  and  the  pieces 
stored  on  difrerent  nodes.  Then,  during  the  ray  tracing 
process,  parts  of  the  database  may  be  shuttled  between 
nodes  as  needed.  Each  computing  node  does  not 
necessarily  need  access  to  the  whole  ODB  for  every  ray 
traced,  but  there  is  currently  no  easy  way  to  predict 
which  parts  it  will  need  access  to.  If  we  could  easily 
predict  which  parts  of  the  ODB  certain  rays  needed 
access  to,  then  the  ODB  distribution  process  could  be 
greatly  simplified.  As  it  is,  we  can  only  guess  at  which 
pieces  are  needed  based  on  previous  accesses.  This  sort 
of  guessing  is  exactly  die  function  that  a  cache 
performs. 

ODB  Decomposition 

In  the  Hypmeube  Ray  Tracer,  the  ODB  is  organized 
into  a  hierarchy.  Primitives  are  present  only  at  the  leaf 
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level.  The  body  of  the  hierarchy  is  composed  of  Hnodes 
whose  sole  purpose  is  to  impose  an  efficient 
geometrical  structure  onto  the  list  of  primitives  which 
facilitates  an  efficient  ray-primitive  intersection  process 
[9].  The  structure  is  that  described  in  [9],  and  the  Hnode 
structure  is  generated  according  to  [S]. 

We  have  chosen  a  dynamic  ODB  decomposition 
where  primitives,  rather  than  rays,  are  transmitted 
between  nodes.  Initially,  the  ODB  is  split  evenly  across 
the  nodes,  just  as  with  Goldsmith's  method  [6].  The 
similarity  ends  here.  The  computing  node  to  which  a 
primitive  is  initially  assigned  is  called  its  home  node, 
and  that  node  will  always  store  a  copy  of  the  primitive. 
Once  a  computing  node  discovers  that  it  does  not  have  a 
part  of  the  ODB  it  needs,  that  primitive  is  requested 
from  its  home  node.  This  is  call^  an  ODB  truss. 

Primitives  are  retained  on  the  nodes  until  their 
memory  is  exhausted  and  space  is  needed  for  another 
primitive.  In  this  way,  a  node  stores  its  share  of  the 
ODB  plus  some  number  of  transitory  primitives. 
Transitory  primitives  are  discarded  as  needed  to 
accommo^te  new  transitory  primitives  needed  in  the 
intersection  process.  The  set  of  transitory  primitives  is 
the  cache.  The  least-recently-used  (LRU)  cache 
replacement  method  is  used  to  select  which  transitory 
primitives  are  no  longer  needed.  Since  only  the  least 
recently  used  transitory  primitives  are  thrown  away,  the 
more  heavily  used  ones  remain  on  the  node.  This 
greatly  reduces  the  ODB  miss  rate,  and  hence  message 
traffic  between  nodes.  The  ideal  condition  of  having  the 
whole  ODB  resident  on  each  node  is  thus  more  closely 
approached. 

This  method  of  ODB  decomposition  has  the  ability 
to  distribute  a  very  large  number  of  primitives  across  a 
number  of  computing  nodes.  Moreover,  the  ODB 
distribution  is  automatically  adjusted  to  place  the  proper 
primitives  just  where  they  are  needed.  Much  better 
performance  is  realized  with  this  strategy  than  with  a 
static  ODB  decomposition,  and  its  fully-associative 
nature  gives  it  an  advantage  over  a  direct  mapped 
caching  scheme  [7]. 

Swapping  Policy 

Sending  messages  from  one  hypercube  node  to 
another  is  a  costly  process.  There  is  a  heavy  time 
penalty  to  set  up  a  message  route  plus  a  modest  penalty 
for  each  byte  transferred.  In  order  to  defray  the  high 
startup  cost,  long  messages  are  preferred  over  short 
ones.  A  single  primitive,  the  result  of  an  ODB  miss, 
would  make  a  very  short  message.  It  is  desirable  to 
send  several  primitives  at  once  when  swapping  is 
required.  But  which  primitives  should  be  picked?  It 
would  be  most  helpful  to  send  additional  primitives 
which  are  likely  to  be  needed  in  the  future.  Indeed,  the 
Kay  ray-primitive  intersection  algorithm  tests  all  of  the 
children  of  a  given  Hnode  at  once.  Therefore,  it  makes 
sense  to  send  all  siblings  of  the  requested  primitive  as 
they  will  all  be  tested.  Thus,  we  move  from  the 


concept  of  swapping  individual  primitives  to  swapping 
all  primitives  associated  with  a  given  Hnode.  lliis  is 
analogous  to  the  notion  of  line  size  in  a  conventional 
cache. 

A  number  of  special  considerations  in  the  hierarchy 
are  required  to  support  this  ODB  decomposition  and 
swapping  scheme.  The  hierarchy  is  composed  of  two 
basic  entities:  the  group  of  Hnod^  which  comprise  the 
interior  of  the  hierarchy,  and  the  primitives.  The 
Hnodes  are  only  responsible  for  about  14.3%  of  the 
total  number  of  nodes  in  the  ODB  [2].  Further,  Hnodes 
takes  only  one  third  the  memory  space  to  store  as 
primitives.  Therefore,  the  Hnode  infrastructure 
eHectively  is  responsible  for  only  about  S%  of  the  size 
of  the  ODB.  Thus,  each  node  can  easily  store  the  ODB 
infrastructure  and  just  swap  groups  of  primitives.  This 
also  allows  the  Kay  algorithm  to  go  all  the  way  to  the 
leaf  level  before  an  ODB  miss  is  possible.  This 
requires  each  Hnode  to  contain  information  about 
whether  or  not  its  child  primitives  are  resident 

Hnodes  must  also  keep  track  of  the  LRU  reference 
word  for  cache  replacement  purposes.  The  Hnode 
structure  contains  the  following  information. 

TABLE  1 

_^____ReldsjiuheHnode^ta^Oucture^^____ 
Field  and  Description 

1.  Pointers  to  sub-hierarchies  (max  8). 

2.  A  unique  ID  number. 

3.  A  bounding  volume  enclosing  all  sub-hierarchies. 

4.  LRU  reference  word. 

5.  A  flag  which  is  true  if  this  Hnode's  child  pnimitives 
are  not  resident. 

Note  that  not  all  Hnodes  have  child  primitives. 
Some  Hnodes  will  have  only  other  Hnodes  as  children. 
These  interior  Hnodes  are  unswappable,  and  do  not  take 
part  in  the  ODB  distribution  process.  The  balance  of 
the  Hnodes  are  called  swappable  Hnodes,  and  do  take 
part  in  the  distribution  process.  Stated  another  way,  an 
Hnode  is  swappable  if  and  only  if  at  least  one  of  its 
children  is  a  primitive. 

As  stated  above,  a  certain  portion  of  the  ODB  must 
remain  resident  on  each  computing  node.  Rather  than 
thinking  of  this  portion  as  a  set  of  primitives,  we  shall 
think  of  it  as  a  set  of  swappable  Hnodes.  The 
swappable  Hnodes  are  divided  evenly  among  the 
processors  rather  than  the  primitives  directly.  In  this 
way,  the  child  primitives  of  a  swappable  Hnode  are 
nevo-  split  between  two  computing  nodes.  In  a  scene 
with  a  large  number  of  primitives,  the  unevenness  in 
the  distribution  of  primitives  caused  by  this  method  is 
negligible.  Hnodes  are  assigned  to  computing  nodes  in 
a  round-robin  fashion  as  their  ID  numbers  order  them. 
This  randomizes  the  ODB's  initial  distribution,  and 
helps  to  even  out  the  burden  of  ODB  requests. 

Primitives  are  swapped  in  and  out  as  groups.  The 
LRU  replacement  algorithm  targets  the  Hnode  whose 
LRU  reference  word  is  smallest  for  rq>lacement.  All 
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child  primitives  of  the  target  Hnode  are  freed,  and  the 
Hnode  is  marked  as  swapped.  The  targeting  and  freeing 
operations  are  repeated  until  enough  space  is  available 
for  the  incoming  primitives. 

The  Ray  Tracing  Loop 

Now  that  ODB  distribution  has  been  addressed,  we 
must  now  address  the  problems  this  causes  in  the  ray 
tracing  loop.  Since  parts  of  the  ODB  can  be  missing 
on  each  node,  the  Kay  intersection  algorithm  may  fail. 
When  it  does  fail,  a  request  for  the  missing  primitive 
must  be  formulated  and  sent  to  the  primitive's  home 
node.  (A  primitive's  home  node  is  ba^  on  the  unique 
ID  number  assigned  to  the  Hnode  parent  of  the 
primitive.)  Once  the  requested  primitive  is  received  and 
inserted  into  the  ODB,  we  must  restart  the  intersection 
algorithm  from  where  it  stopped.  This  prevents 
thrashing,  but  requires  paying  a  considerable  price  in 
terms  of  program  complexity. 

Once  an  ODB  miss  occurs,  what  happens  while  the 
node  is  waiting  for  primitives  tom  another  node?  That 
node  may  do  one  of  two  things:  wait  for  the  primitive 
to  be  sent  tom  another  node,  or  work  on  another  ray. 
Considering  the  cost  of  sending  a  message  to  another 
node,  and  waiting  for  it  to  reply,  waiting  is  out  of  the 
question.  Therefore,  the  node  must  occupy  its  time 
doing  something  constructive;  processing  anotho'  ray  is 
an  ideal  choice.  Therefore,  the  entire  state  of  the  ray 
tracing  process  must  be  saved  when  an  ODB  miss 
occurs.  The  ideal  place  to  save  this  information  is  in 
the  same  data  structure  as  the  offending  ray  so  that  the 
ray's  entire  context  is  neatly  localized. 

Now  that  intersection  may  be  stopped  and  restarted, 
we  must  consider  another  step  in  the  ray  tracing 
process,  namely  the  shading  step.  The  shading  model 
casts  shadow  rays  every  time  it  is  evaluated,  and 
optionally  casts  reflected  and  refracted  rays.  These 
secondary  rays  must  also  be  ray-traced.  Since  they  may 
also  cause  ODB  misses,  the  shading  model  evaluation 
must  be  made  interruptible,  too!  To  complicate  matters 
further,  the  shading  model  may  be  interrupted  in  no  less 
than  three  different  locations:  once  for  each  light  source 
when  casting  a  shadow  ray.  once  for  the  reflected  ray, 
and  once  for  the  refracted  ray!  Now,  the  ray  tracing  loop 
has  become  a  very  complex  choreography  of 
interruptible  states,  spawning  of  sub-rays,  and 
resumption  of  control.  The  following  finite-state 
automaton  (FSA)  is  our  solution  to  the  control  problem 
(Figure  2).  In  T^le  2,  we  see  that  rays  are  div^ed  into 
a  number  of  different  types:  primary  rays,  secondary 
rays,  and  shadow  rays.  The  only  difference  between  the 
types  is  the  way  in  which  the  shading  model  operates. 
For  primary  rays,  the  full  shading  model  is  evaluated, 
and  the  result  is  stored  at  the  appropriate  pixel 
coordinates  in  the  local  frame  buffer.  Secondary  rays 
execute  the  full  shading  model,  but  pass  their  intensity 
to  their  parent  ray  rather  than  the  tome  buffer.  Shadow 
rays  need  not  be  shaded  at  all,  only  intersected  with  the 


ODB. 

TABLE 2 

State  and  Description 

1 .  Ready  to  inteisect  -  A  ray  is  set  up  and  ready  be  be 
intersected  against  the  ODB. 

2.  Pending  object  from  another  node  -  A  ray  has 
suffered  an  ODB  miss,  and  is  waiting  for  the  required 
primitives  to  be  sent  from  elsewhere. 

3.  Shadow  ray  setup  -  Shadow  rays  are  set  up  and 
spawned  from  this  state.  A  ray  to  a  different  light 
source  is  spawned  each  time  this  state  is  entered. 

4.  Pending  on  shadow  ray  -  Once  a  shadow  ray  has 
been  spawned,  the  parent  ray  must  wait  for  it  to 
complete. 

5 .  Process  shadow  ray  -  The  result  of  the  shadow  ray 
intersection  are  stored  and  control  is  passed  back  to 
the  "shadow  ray  setup"  state  to  cast  more  shadow 
rays. 

6 .  Shading  -  This  step  in  the  shading  model  performs 
all  operations  that  depend  only  on  the  results  of  the 
shadow  rays.  i.e.  ambient,  diffuse,  and  specular 
components. 

7.  Reflective  shading  -  If  the  surface  of  the  primitive 
in  question  is  reflective,  this  state  spawns  a 
reflected  ray. 

8 .  Pending  reflected  ray  -  Control  comes  here  to  wait 
on  a  reflected  ray  to  be  traced. 

9.  Process  reflected  ray  -  The  contribution  of  the 
reflected  ray  is  added  into  the  overall  shading  in  this 
state. 

10.  Transmissive  shading  -  If  the  surface  of  the 
primitive  in  question  is  transmissive,  this  state 
spawns  a  refracted  ray. 

11.  Pending  transmitted  ray  -  Control  comes  here  to 
wait  on  a  refracted  ray  to  be  traced. 

12.  Process  transmitted  ray  -  The  contribution  of  the 
refracted  ray  is  added  into  the  overall  shading  in  this 
state. 

13.  Forward  results  -  Ray  results  are  ready  to  be  passed 
on.  Depending  on  the  ray  type,  the  results  are  either 
put  in  the  local  frame  buffer  (primary  ray),  or 
forwarded  to  the  parent  ray  (shadow  or  secondary 
ray). 

TABLES 

1 .  ODB  miss  -  This  event  is  posted  by  the  "ready  to 
intersect"  state  when  an  ODB  miss  occurs. 

2.  Object  received  -  This  event  is  posted  when 
primitives  arrive  from  another  node  as  the  result  of 
an  ODB  miss. 

3.  Spawn  -  Posted  whenever  a  state  had  to  spawn  a 
subray.  This  happens  for  shadow  rays,  reflected 
rays,  and  refracted  rays. 

4.  Complete  -  This  event  is  posted  to  a  parent  ray 
when  a  child  ray  has  completed. 

5.  Done  -  Posted  by  state  action  functions,  this  event 
signals  that  the  state  completed  successfully,  and 
the  ray  is  ready  to  move  on  to  the  next  state. 
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6. 


Backtrack  -  If  a  shadow  ray  intersects  a  transparent 
object,  it  is  not  necessarily  occluded.  Used  to 
restart  the  intersection  process  to  find  the  next 
intersection  point  along  the  shadow  ray. 

7.  Missed  -  If  the  intersection  process  misses  all 
objects  in  the  ODB,  this  event  shortcuts  straight  to 
the  "Forward  Results”  state. 


Transitions  between  ray  states  are  caused  by  events 
posted  to  a  speciftc  ray  during  the  ray-tracing  loop 
(Table  3).  These  events  are  based  on  the  result  of  the 
action  associated  with  each  state.  Actions  perform  the 
various  steps  in  the  ray  tracing  process.  For  example, 
the  action  associated  with  the  first  state  in  the  FSA  is 
to  try  to  intersect  the  ray  with  the  ODB.  If  the 
intersection  fails,  the  action  function  posts  an  ODB 
miss  event  for  the  ray,  and  terminates.  The  result  of 
this  event  is  to  place  the  ray  in  the  Pending  Object  from 
Another  Node  state.  If  the  intersection  succeeds,  the 
action  function  posts  a  Done  event  and  terminates.  The 
result  of  this  event  is  to  place  the  ray  into  the  Shadow 
Ray  Setup  state.  The  concept  of  state  driven  ray  tracing 
complicates  the  classical  ray  tracing  loop,  but  divides  it 
into  an  interruptible  series  of  modular  operations. 
Although  some  of  the  states  presented  above  could  be 
merged,  they  are  left  separate  for  clarity. 

As  stat^  earlier,  multiple  rays  are  used  so  time  is 
not  wasted  waiting  for  ODB  misses  to  be  resolved. 
Since  there  can  be  a  number  of  pending  rays  equal  to  a 
preset  maximum  recursion  depth,  a  way  is  needed  to 
keep  track  of  all  of  these  rays.  A  way  is  iso  required  to 


keep  track  of  events  destined  for  a  particular  ray.  The 
solution  used  here  is  a  ray  queue  to  keep  the  rays,  and 
an  event  queue  to  keep  track  of  the  events. 

As  new  rays  are  created,  they  are  pushed  onto  the 
ray  queue  to  begin  their  journey  through  the  states  that 
will  ray  trace  them.  Similarly,  ray-event  tuples  are 
pushed  onto  the  event  queue  for  evaluation.  A  scheduler 
is  responsible  for  driving  the  FSA  from  the  rays  and 
events.  The  scheduler  sits  at  the  top  of  the  control 
structure  for  the  new  node  ray-tracing  loop.  Algorithm  1 
gives  the  pseudocode  representation  for  the  node 
program.  One  will  notice  the  striking  resemblance 
between  Algorithm  1  and  any  standard  round-robin  task 

Download  ODB  from  the  host 

While  there  are  pixels  to  ray  trace  and  ray 
queue  not  empty 

/*  Service  queued  events.  */ 

While  event  queue  not  empty 
Pop  event  queue 
Determine  next  state  of  ray 
Endwhile 

/*  Service  prim,  requests  ♦/ 

/*  from  other  nodes.  */ 

If  ODB  request  from  other  node 
Pack  requested  portion 
Send  it  to  requesting  node 
Endif 

/»  ODB  Request  Reply  */ 

If  there  is  ODB  request  reply 
Receive  the  message 
Unpack  it  into  local  ODB 
Notify  all  rays  pending 
Endif 

/•  Add  a  new  primary  ray  */ 

If  there  is  room  on  ray  queue 
Construct  a  new  primary  ray 
Push  it  onto  the  ray  queue 
Endif 

/•  Execute  a  ray's  state  */ 

/*  function.  */ 

Pop  a  ray  from  ray  queue 
Execute  its  state  function 
Endwhile 

Send  local  frame  buffer  to  host 

Algorithm  1:  Scheduler  for  State  Driven  Node  Program 

scheduler.  In  the  ray  tracer’s  case,  the  analog  for  a 
process  is  the  ray. 

Image  Decomposition 

When  the  leap  is  made  from  a  duplicated  ODB  to  a 
distributed  ODB  (DODB),  many  things  change.  The 
structure  of  the  ODB  changes  from  a  fully  intact 
hierarchy  to  a  hierarchy  missing  most  of  its  leaves. 
The  hierarchy  nodes  themselves  become  more  complex. 
Ray-ODB  intersection  becomes  an  interruptible,  re¬ 
entrant  process  rather  than  classical  straight-line  code. 
Even  the  ray  tracing  loop  itself  changes  from  a 
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regimented  and  easy-to-understand  loop  into  a  complex 
scheduler  driving  a  thirteen  state  FSA. 

After  such  a  drastic  change  to  the  basic  ray  tracing 
loop,  the  suitability  of  the  standard  image 
decomposition  needs  to  be  reassessed.  Using  the 
"comb"  decomposition,  the  image  plane  is  divided  into 
equal-area  pieces,  and  each  piece  assigned  to  a 
computing  n^e  for  ray-tracing  [2].  Within  the  assigned 
area,  which  is  usually  rectangular,  pixels  are  ray-traced 
from  the  upper  left  comer  toward  the  bottom  right 
comer  by  rows.  There  is  one  basic  problem  associated 
with  this  decomposition  -  that  of  locality.  By 
choosing  a  different  image  decomposition,  we  may 
reduce  the  number  of  ODB  misses  considerably.  If  the 
DODB  is  to  p^orm  well,  then  the  rays  tested  against  it 
should  be  fairly  localized  with  respect  to  their  positions 
and  directions  within  the  scene.  This  locality  of 
reference  keeps  the  number  of  ODB  misses  down,  and 
the  performance  up.  If  widely  varying  rays  are 
intersected  against  the  DODB.  then  there  will  be  a  much 
higher  miss  rate,  and  correspondingly  lower 
performance.  Experiments  verify  the  lower  perfwmance 
of  the  standard  decomposition,  and  show  a  very  poor 
load  balance.  It  is  therefore  desirable  to  invent  a  new 
image  decomposition  to  solve  the  load  balance  and 
locality  problems  simultaneously. 

The  solution  used  by  the  Hypercube  Ray  Tracer  is 
what  is  generally  called  the  "block”  decomposition. 
The  image  plane  is  divided  into  a  large  number  of  small 
rectangular  blocks  which  are  assigned  dynamically  to 
processors.  Each  block  encloses  pixels  that  one  node 
will  be  responsible  for  ray  tracing.  These  blocks  are 
small  enough  such  that  all  rays  passing  through  it  can 
be  considered  coherent.  When  a  computing  node 
finishes  ray  tracing  all  of  the  pixels  in  its  block,  then  a 
new  block  is  assigned  which  is  spatially  close  to  the 
previous  one.  In  this  method,  the  dynamic  block 
assignment  solves  the  load  balancing  problem,  and  new 
blocks  are  chosen  close  to  old  blocks  to  give  heightened 
locality  of  reference  in  the  ODB. 

Block  assignments  are  managed  by  a  separate 
program  running  on  the  iPSC/2's  System  Resource 
Manager  (SRM).  As  computing  nodes  complete  their 
blocks,  they  send  the  block's  frame  buffer  to  the  SRM 
where  is  is  copied  into  the  final  image.  After  this  is 
done,  the  SRM  assigns  a  new  block  to  the  node  as  close 
to  the  old  one  as  possible.  This  process  continues  until 
all  blocks  have  been  ray-traced. 

Results 

Figure  3  show  the  number  of  ODB  requests 
received  by  all  32  nodes  during  the  course  of  ray-tracing 
a  single  image  of  4208  primitives.  In  this  case,  the 
cache  size  amounted  to  some  10%  of  the  size  of  the 
whole  ODB.  This  figure  shows  the  evenness  of  the 
ODB  distribution  across  the  nodes.  No  one  node,  or  set 
of  nodes,  are  bearing  the  Ixunt  of  ODB  swapping  traffic. 
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Figure  3:  ODB  Requests 
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Figure  4:  Ray  Tracing  Time 


Figure  5:  Cache  Hit  Ratio 

In  Figure  4,  we  see  the  effects  of  reducing  the  cache 
size  for  three  different  images.  The  "Bar  Image" 
contains  4208  primitives,  the  "Self  Portrait"  image 
contains  1410  primitives,  and  the  "Snowman  Army 
from  Hell"  image  contains  1310  primitives. 
Specifically,  Figures  4  shows  total  ray-tracing  time 
versus  cache  size  and  Figure  5  shows  cache  hit  rate 
versus  cache  size.  Note  that  total  ray-tracing  time  is  the 
sum  of  the  ray-tracing  times  for  ^1  32  nodes  of  the 
iPSC/2.  In  all  cases,  the  cache  hit  rate  remains  very 
high,  and  the  ray-tracing  time  remains  relatively 
constant  until  a  cache  size  of  roughly  20%  is  reached. 
Below  a  cache  size  of  20%,  the  ray-tracing  time 
increases  rapidly  with  the  decreasing  cache  hit  rate.  One 
will  note  that  the  knee  of  the  total  ray-tracing  time 
curve  is  at  a  slightly  lower  cache  size  than  the  knee  of 
the  hit  rale  cur\’e.  This  effect  is  caused  by  the  now 
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interruptible  nature  of  the  ray-tracing  loop.  When  an 
ODB  miss  occurs,  the  offending  ray  is  placed  on  the  ray 
queue,  and  another  ray  is  initiated.  Even  though  the 
cache  performance  decreases,  the  ray-tracing  speed  does 
not!  Not  shown  is  the  dependence  of  performance  on 
ray  queue  size.  A  slightly  larger  ray  queue  makes  a 
marked  difference  in  performance  for  small  cache  sizes. 

As  we  can  see  from  the  performance  figures,  the 
Hypercube  Ray  Tracer's  ODB  distribution  strategy  is 
very  effective  for  cache  sizes  down  to  only  15%  of  the 
tot^  ODB  size.  Furthermore,  the  cache  implements  a 
fully-associative,  self-balancing  structure  that  requires 
no  preprocessing  to  set  up.  The  fully-associative  nature 
of  the  cache  insures  that  no  heavily-used  primitive 
groups  are  replaced  just  because  they  happen  to  lie  in 
the  same  set  as  another  transitory  primitive.  We  must 
not  forget  that  the  only  reason  that  such  a  caching 
scheme  can  enjoy  any  success  at  all  lay  in  the  temporal 
coherence  with  which  the  ODB  is  queried  by  the  ray- 
object  intersection  process.  This  temporal  coherence  is 
due  directly  to  the  fact  that  rays  are  traced  in  close 
proximity  to  one  another  on  the  viewplane.  If  the  rays 
were  traced  randomly  throughout  the  viewplane,  then  no 
amount  of  caching  would  improve  perft^ance. 

Conclusions  and  Future  Work 

We  can  also  see  that  this  caching  scheme  scales  up 
well  with  the  number  of  nodes,  and  particularly  well 
with  the  ODB  size.  Also,  since  there  are  no  artificial 
boundaries  imposed  on  the  ODB  cache,  it  takes 
maximum  advantage  of  node  memory,  duplicating  those 
portions  of  the  ODB  which  are  used  heavily. 

As  with  all  things  in  life,  there  is  a  price  for  such 
progress,  and  the  Piper's  name  is  Complexity.  Since  an 
ODB  miss  can  happen  at  any  point  during  the  ray¬ 
tracing  process,  the  state  of  each  ray  must  be  saved  until 
such  time  as  it  can  resume.  But  since  the  ray-tracing 
process  itself  is  so  time-consuming,  this  additional 
complexity  imposes  little  to  no  performance  penalty. 

A  number  of  elements  in  this  distributed  ODB 
scheme  warrant  further  investigation.  One  is  the  initial 
construction  of  the  hierarchy.  The  ODB  hierarchy  is  the 
one  fixed  data  structure  remaining  in  the  Hypercube  Ray 
Tracer.  Presently,  it  is  constructed  to  minimize  the 
total  surface  area  of  the  bounding  volumes  of  all  sub¬ 
hierarchies.  This  policy  makes  a  considerable 
performance  difference  with  respect  to  a  blindly 
constructed  hierarchy.  However,  it  leads  to  primitives 
being  placed  near  the  root  of  the  hierarchy.  It  is  unclear 
just  how  much  this  ragged  hierarchy  structure  impacts 
the  performance  of  the  distributed  ODB. 

Rays  could  be  swapped  across  nodes  as  well  as 
primitives.  This  would  more  economical  in  cases 
where  a  large  number  of  ODB  misses  would  occur. 
Instead  of  shipping  all  of  the  nonresident  primitives  to  a 
node,  the  node  would  have  the  option  of  shipping  the 
offending  ray  to  the  primitives. 

Perhaps  most  tantalizing  would  be  a  dynamic 


restructuring  of  the  entire  ODB  hierarchy  to  take  more 
advantage  of  ray  coherence.  Presently,  the  Kay 
intersection  algorithm  takes  no  advantage  of  ray 
coherence  since  the  structure  of  the  ODB  is  fixed.  If  a 
scheme  could  be  devised  to  restructure  the  ODB  toward 
the  goal  of  decreasing  the  number  of  tests  performed  by 
the  Kay  algorithm,  then  the  performance  of  the 
distributed  ODB  cache  would  be  improved  as  well  as 
absolute  ray-primitive  intersection  time. 

A  short-term  goal  is  to  investigate  the  effect  of  a 
least-frequently-used  (LFU)  cache  replacement  policy. 
A  longer-term  goal  might  be  to  include  secondary 
storage  in  the  ODB  distribution  scheme.  A  strategy  of 
this  sort  would  make  possible  the  ray-tracing  of  scenes 
of  well  into  the  millions  of  primitives. 
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Abstract 

In  this  paper,  three  sorting  algorithms,  Bitonic 
sort.  Shell  sort  and  parallel  Quicksort  are  studied. 
We  analyze  the  performance  of  these  algorithms  and 
compare  them  with  the  empirical  results  obtained 
from  the  implementations  on  the  Symult  Series  2010, 
a  distributed-memory,  message-passing  MIMD  ma¬ 
chine.  Each  sorting  algorithm  is  a  combination  of 
a  parallel  sort  component  and  a  sequential  sort  com¬ 
ponent.  These  algorithms  are  designed  for  sorting  M 
elements  of  random  integers  on  a  JV-processor  ma¬ 
chine,  where  M  >  N.  We  found  that  Bitonic  sort 
is  the  best  parallel  sorting  algorithm  for  small  prob¬ 
lem  size,  {M/N)  <  64,  and  the  parallel  Quicksort 
is  the  best  for  large  problem  size.  The  new  Paral¬ 
lel  Quicksort  algorithm  with  a  simple  key  selection 
method  achieves  a  decent  speed-up  comparing  with 
other  versions  of  parallel  Quicksort  on  similar  parallel 
machines.  Although  Shell  sort  has  a  worse  theoret¬ 
ical  time  complexity,  it  does  achieve  linear  speedup 
for  large  problem  size  by  using  a  synchronization  step 
to  detect  early  termination  of  the  sorting  steps. 

Introduction 

As  indicated  by  Knuth  in  his  famous  book  on  sorting 
and  searching  [l]: 

It  would  be  nice  if  only  one  or  two  of  the 
sorting  methods  would  dominate  all  of  the 
others,  regardless  of  the  application  or  the 
computer  being  used.  But  in  fact,  each 
method  has  its  own  peculiar  virtues . 

This  remains  true,  if  not  more  so,  for  sorting  algo¬ 
rithms  on  parallel  machines  for  two  reasons.  First, 
the  perform2mce  of  a  parallel  sorting  algorithm  de¬ 
pends  on  the  degree  of  parallelism  it  can  exploit  on  a 
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given  architecture.  For  example,  it  often  makes  dif¬ 
ference  on  level  of  parallelism  one  could  exploit  on  an 
SIMD  and  on  an  MIMD  machine.  Second,  the  per¬ 
formance  also  depends  on  the  speed  of  certain  criti¬ 
cal  operations  the  underlying  parallel  machine  could 
deliver.  For  example,  interprocessor  communicar 
tion  could  be  a  dominating  operation  for  distributed- 
memory  machines  because  parallel  sorting  algorithms 
often  require  the  same  order  of  magnitude  of  commu¬ 
nication  steps  as  that  of  computation. 

In  this  paper,  we  focus  on  only  a  class  of  MIMD  ma¬ 
chine  on  which  the  issue  of  interconnection  network 
is  not  very  important,  and  the  communication  speed 
is  nearly  balanced  with  the  computation  speed.  By 
choosing  such  seemingly  general-purposed,  yet  real, 
machine,  we  are  able  to  concentrate  on  finding  which 
sorting  methods,  or  combinations  of  sorting  methods, 
are  possibly  among  the  fastest  on  an  MIMD  machine. 

We  shall  also  confine  ourselves  to  sorting  a  long  list 
of  random  input  data  using  less  number  of  processing 
elements  (or  nodes).  That  is,  the  sorting  problem  we 
are  interested  in  is  to  sort  M  elements  of  random  in¬ 
tegers  on  an  iV-node  MIMD  machine,  where  M  >  N. 
Initially,  M  unsorted  elements  are  evenly  distributed 
to  each  computation  node.  Each  node  operates  on 
its  own  set  of  data  independently,  but  can  send  or 
receive  data  from  another  node.  When  all  nodes  ter¬ 
minate,  each  node  should  hold  a  chunk  of  sorted  list, 
and  chunks  are  stored  in  consecutive  order  across  all 
nodes  such  that  the  smallest  chunk  is  stored  in  the 
first  node  and  so  on.  Chunk  size  may  or  may  not  be 
M/N  depending  on  the  algorithm  used. 

Because  of  the  problem  nature  M  >  N,  each  of 
the  three  sorting  algorithms  we  have  implemented 
is  a  combination  of  parallel  sort  (across  nodes)  and 
sequential  sort  (for  local  list).  We  used  a  parallel 
version  of  Quicksort  [2],  Batcher’s  bitonic  sort  [4], 
and  a  mixture  of  Shell’s  sort  and  odd-even  transpo¬ 
sition  sort  [1]  as  our  parallel  sorting  strategies,  and 
the  UNIX/BSD  qsort  routine  as  the  sequential  sort¬ 
ing  method.  For  simplicity,  we  shall  call  our  algo- 
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rithms  Bitonic  sort,  Shell  sort  and  parallel  Quicksort, 
respectively,  in  the  following  text. 

Quicksort  is  not  only  a  fast  sequential  sort  method, 
it  is  also  a  parallel  method  by  its  divide-and-conquer 
nature.  The  only  potential  problem  with  the  effi¬ 
ciency  of  a  parallel  Quicksort  is  the  selection  of  its 
splitting  keys.  If  such  keys  are  randomly  selected, 
the  input  list  cein  be  divided  into  uneven  sublists  and 
cause  load  unbalancing.  Carefully  calculated  splitting 
keys  will  solve  this  problem  but  the  extra  calculation 
becomes  a  cost  itself.  So  an  efficient  implementation 
needs  to  strike  a  balance  between  two  extremes,  which 
is,  fortunately,  not  very  hard  to  achieve.  Impres¬ 
sive  results  for  parallel  Quicksort  have  been  reported 
for  a  vector  machine  CDC  STAR  [3]  and  hypercube- 
interconnected  MIMD  machines  [5],  among  others. 

Batcher’s  bitonic  sort  [4],  on  the  other  hand,  has 
been  widely  used  across  almost  all  kinds  of  parallel 
computers  -  sorting  networks,  hypercube  machines 
[6],  two-dimension  mesh  machines  [9],  SIMD  ma¬ 
chines  [7]  for  its  simplicity  and  stability.  It  has  a  time 
complexity  0(log^  M)  for  sorting  M  elements,  which 
is  reasonably  efficient.  The  less  known  Shell  sort  is 
also  selected  because  it  appears  to  be  a  very  efficient 
algorithm  when  implemented  on  Caltech/JPL’s  Hy¬ 
pe  rcube  machine  [5]. 

In  the  rest  of  the  paper,  we  will  first  introduce 
the  underlying  machine  we  used  in  our  study,  and  its 
computation  and  performance  model;  followed  by  the 
three  sorting  algorithms  and  their  time  complexities. 
Then,  we  will  discuss  our  empirical  performance  re¬ 
sult,  and  address  a  few  related  issues  such  as  how 
general  our  result  can  be,  and  what  other  sorting 
methods  may  also  be  considered. 

Computation  and  Performance  Model 

The  Symult  Series  2010  system  (S2010)  is  a 
distributed-memory  message-peissing  MIMD  com¬ 
puter  consists  of  up  to  1024  computational  nodes  in¬ 
terconnected  by  a  high  speed  message-routing  net¬ 
work  (GigaLink).  Each  computational  node  has  a 
Motorola  MC68020  microprocessor  as  its  CPU,  op¬ 
erating  at  25  MHz  and  augmented  by  the  Motorola 
68881  floating-point  co-processor.  A  SUN-3  worksta¬ 
tion  is  used  as  the  front-end  computer.  The  operating 
system  on  the  S2010  nodes  is  called  Reactive  Kernel, 
and  the  programming  environment  on  the  front-end 
computer,  serving  as  the  interface  between  the  users 
and  the  S2010,  is  called  Cosmic  Environment  [8). 

S2010  is  facilitied  with  a  fast  communication  net¬ 
work,  called  GigaLink  network.  A  custom-designed 


message  routing  chip  —  Automatic  Message  Routing 
Device  (AMRD)  —  provides  fast  fixed-route  point- 
to-point  message  routing  using  “worm-hole”  routing 
algorithm.  The  interprocessor  communication  rate  is 
13MB/sec  regardless  of  the  distance  between  source 
and  destination.  This  feature  makes  the  S2010  re¬ 
semble  to  a  fully-connected  machine. 

To  characterize  the  machine  behavior,  we  carefully 
measured  timing  for  many  computation  and  commu¬ 
nication  instructions.  Here  are  some  of  the  timing 
results  useful  for  our  sorting  analysis,  where  one  in¬ 
teger  is  equivalent  to  four  bytes- 

•  copy  one  integer  from  one  memory  location  to 
another,  without  taking  memory  allocation  over¬ 
head  into  account,  takes  about  0.45  ps; 

•  memory  allocation  overhead  per  memory  copy 
function  (bcopy)  is  about  8  fis\ 

•  comparison-exchange  for  two  integers  takes 
about  6.8  ps; 

•  transmitting  one  integer  in  a  typed  message  from 
one  node  to  another,  without  taking  overhead 
into  account,  takes  about  0.31  /is  under  low  to 
normcil  traffic  load. 

•  average  overhead  for  sending  a  typed  message 
from  one  node  to  another  takes  about  251  ps. 

In  other  words,  if  routing  a  message  with  size 
K  (integers)  takes  time  Tra^tt{K),  copying  a  same 
size  message  locally  takes  time  Tcopy{K),  and  per¬ 
forming  compare-exchange  on  K  integers  takes  time 
'^comp  —  ex  (K),  then 

Trcute(K) 

~~  ^route— over  head  ^routt— »nt  *  K 

=  251/js  +  0,31fia  ■  K 

TcojoyfA) 

—  ^eopy  — overhead  ^  ^eopy—int  '  ^ 

=  8/i5  +  OASfid  •  K 

'^comp—ex  [K)  =  6.8US-K  (1) 

So  we  could  conclude  that,  on  S2010,  the  overhead 
for  each  message  send/receive  is  very  large,  but  the 
transmission  speed  is  comparable  to  that  of  memory 
access  speed.  Therefore,  in  our  sorting  algorithms, 
the  interprocess  communication  is  a  dominating  term 
when  M/N  is  small,  and  it  graduately  reduces  its 
effect  when  M/N  gets  larger.  Assuming  the  number 
of  compare-exchange  steps  is  the  same  as  the  number 
of  messaging,  then  the  communication  overhead  (i.e. 
TronttiK) /Tcamp-tx{K))  is  41.4%  for  K  —  100  and 
drops  to  8.2%  for  K  =  1000. 

In  the  following  timing  analysis,  we  shall  assume 
that  ecich  element  to  be  sorted  is  represented  as  an 
integer,  for  simplicity.  Thus  K  means  the  number  of 
elements  in  each  step  of  computation. 


225 


The  sequential  qaort  routine  takes  an  important 
role  in  all  three  algorithms,  it  has  a  time  complex¬ 
ity  O(JiCloglir)  for  a  single  S2010  node  to  sort  K  el¬ 
ements.  By  experiments,  we  found  that  the  timing 
equation  for  qsort  with  random  input  data  can  be 
represented  as  follows: 

T^.ori(K)  =  0(K  log  K)  =  a.SuaK  log  K  (2) 


Bitonic  Sort 

In  our  implementation  of  this  algorithm,  the  machine 
is  configured  as  a  IV-node  hypercube.  Initially  each 
node  has  M/N  unsorted  elements.  Each  node  first 
sorts  its  data  internally  using  the  qsort  routine,  and 
then  performs  (loglV'(logiV-H))/2  steps  of  compare- 
exchange  operation  along  all  dimensions  of  the  cube. 
After  running  the  algorithm,  every  nodes  have  M/N 
elements  sorted  both  locally  and  globally. 

Our  cdgorithm  for  each  individual  node  is  shown 
below,  where  dim,  my.nid,  and  mask  are  the  dimen¬ 
sion  of  the  cube,  the  node  id,  and  a  mask  fiag  for 
selection  of  nodes,  respectively; 

1.  Sort  the  (M/N)  elements  locally  in  each  node 
using  a  qsort.  Sort  in  ascending  order  if  my.nid 
is  even,  in  descending  order  if  my.nid  is  odd. 

For  i  :=  0  to  (dim  —  1)  step  1  do  (2),  (3),  and  (4) 

2.  If  the  (i-l-l)-bit  of  my  binary  address  is  1, 
mask  :=  1;  otherwise,  mask  :=  0. 

3.  For  j  :=  i  to  0  step  —1  do 

(a),  exchange  my  (M/N)  elements  with  my  j- 
th  bit  neighbor;  (b).  compare/exchange  the  two 
lists  and  copy  the  smaller  half  into  the  data  area 
if  mask  =  the  j-th  bit  of  my  binary  address;  copy 
the  larger  half  into  the  data  area  otherwise. 

4.  Locate  the  maximum  (or  minimum)  of  the 
bitonic  sequence  in  each  node  and  perform  a 
merge  on  sublists  of  length  M/N .  The  sorted 
sublist  is  in  ascending  order  if  mask  =  0;  other¬ 
wise,  it  is  in  descending  order. 

Since  each  node  has  M/N  elements,  the  time 
complexity  of  step  (1)  is  0(^(log^)).  Each 
compare-exchange  iteration  in  (3)  takes  time 
2Trouu(f)  +  Tcomp-,z(w)  +  )■  ^nd  there 

are  (log  .Ar(log  JV  -|-  l))/2  iterations  in  total.  Each 
merge  operation  in  step  (4)  takes  time  Tcomp-tx(^  + 
2 log  ^),  and  this  operation  is  performed  logN  times. 


As  a  result,  the  tot^d  time  for  Bitonic  sort,  based  on 
our  timing  equations  (1)  and  (2),  can  be  expressed 
with  unit  time  ps  as  follows: 

Tsitonic  =  0(M(log^))  + 

(logN(logN  +  l)/2)  .  (510  +  7.87(:^))  + 
6.81ogA/’(^  +  21og-^) 

„  M,,  M, 

(241.4 +  3.94^)log*JV  + 

JV 

M 

(13.61ogM  +  10.74—  +  255)logjy  (3) 

The  empirical  timing  curve  for  IV  =  16  is  shown  in 
figure  1. 

Shell  Sort 

As  described  earlier,  here  Shell  sort  means  a  method 
that  combines  Shell’s  method  and  odd-even  transpo¬ 
sition  sort  as  internode  sort,  and  qsort  as  sequential 
sort.  This  algorithm,  as  well  as  the  parallel  Quick¬ 
sort  algorithm,  are  to  be  executed  on  a  ring  topology 
-  the  sorted  data  will  be  stored  in  the  same  way  as 
in  the  Bitonic  sort  case,  but  with  a  slight  difference 
that  node  address  is  arranged  in  ring  topology.  Both 
hypercube  and  ring  topologies  can  be  easily  config¬ 
ured  on  the  S2010,  without  significant  performance 
difference. 

This  algorithm  has  three  steps: 

1.  Sort  (M/N)  elements  locally  in  each  node  with 
qsort. 

2.  Do  a  compare-exchange  operation  between  pairs 
of  adjacent  nodes  along  the  i-th  cube  dimension, 
for  i  =  logiV  —  1,...,0. 

3.  Do  compare-exchange  operations  between  pairs 
of  adjacent  nodes  in  the  ring  topology  until  no 
exchange  is  made  in  all  the  node. 

The  first  part  is  a  Shell’s  sort  except  that  only 
one  compare-exchange  operation  is  performed  in  each 
hypercube  dimension  and  the  result  list  is  partially 
sorted.  This  part  takes  logJV  compare-exchange  op¬ 
erations  in  total.  The  second  part  is  an  odd-even 
transposition  sort  which  terminates  when  no  data  is 
exchanged  in  all  the  node. 

The  number  of  odd-even  transposition  steps  is 
equal  to  the  maximal  distance  of  a  mispositioned  el¬ 
ement  to  its  sorted  position.  After  the  diminishing- 
increment  steps,  the  worst-case  maximal  distance  is 
(N  -  2v/jV  +  1),  where  N  is  the  number  of  nodes. 
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Given  an  arbitrary  element  a,  assuming  that  y  and  x 
are  the  addresses  of  the  nodes  that  a  is  located  after 
step  (1)  and  after  sorting,  respectively.  Thus,  |y— xj  is 
the  number  of  odd-even  transposition  steps  required 
to  move  a  to  its  final  position.  Let  a  be  such  an  ele¬ 
ment  that  has  maximal  jy  — xj  and  x  <  y,  now  we  like 
to  find  the  minimal  x  for  a  given  y.  Let  the  binary 
address  of  y  be  (yi ,  ya , . . . ,  ya) ,  where  d  —  log  N ,  i.e. , 
the  dimension  of  the  cube.  After  step  (1),  all  the  el¬ 
ements  in  the  nodes  whose  addresses  can  be  derived 
from  y  by  changing  one  or  more  yj’s  from  1  to  0  should 
be  smaller  than  the  elements  in  y.  If  there  are  k  “1” 
bits  in  the  binary  address  of  y,  there  will  be  at  least 
(2*  —  1)  nodes  in  which  the  elements  are  smaller  than 
those  in  y  after  step  (1).  Thus,  the  minimal  element 
that  may  be  located  in  node  y  after  step  (1)  is  al¬ 
ways  greater  than  the  elements  in  the  first  (2*  —  1) 
nodes  after  sorting.  In  other  words,  the  minimal  a  in 
node  y  will  be  stored  in  the  2*-th  node  (x  =  2*  -  1) 
after  sorting  is  done.  To  maximize  |y  —  x|,  we  shall 
find  the  maximal  y  with  a  proper  k.  Obviously,  the 
maximal  y  which  has  k  “1”  bits  is  the  one  having  all 
I’s  in  the  most  significant  bits  and  y  —  N  — 

So  (y  —  x)  =  (iV  —  —  2*  +  1)  and  the  maximal 

(y  —  x)  is  equal  to  {N  —  -|- 1)  or  {N  —  2-\/N  -i- 1) 

when  k  =  d/2. 

Therefore,  in  the  worst  case,  there  are  {N  — 
2‘\/N  -h  1)  compare-exchange  operations  in  step  (3). 
Each  compare-exchange  operation  in  step  (2)  and  (3) 
takes  the  same  time  as  one  iteration  in  Step  (3)  of 
Bitonic  Sort,  i.e.  2TTovt€(.M/N)  +  Tccmp-ex{M/N)  + 
Teoj)y{M/N).  Therefore,  the  worst  case  time  com¬ 
plexity  of  the  Shell  sort  is 

^  M ,  M 

Tshtll  =  8.5—  log  -— + 

N  N 

[N -2y/N +  \oeN){510  +  7.S7^)  (4) 

The  first  term  is  the  time  for  sequential  sort  of  the 
loccil  M/N  element  sublist.  The  real  timing  for  the 
case  IV  =  16  is  shown  in  figure  1. 

Parallel  Quicksort 

The  quicksort  is  a  divide-and-conquer  sorting  algo¬ 
rithm  which  is  potentially  applicable  to  parallel  com¬ 
putation.  In  order  to  get  the  best  performance  of 
the  quicksort,  the  splitting  keys  should  be  selected 
with  great  care  so  that  the  list  to  be  sorted  can  be 
decomposed  into  two  sublists  of  equal  length.  This 
fact  is  even  more  important  in  the  parallel  quicksort 
because  the  improper  selection  of  the  splitting  keys 
results  in  load  imbalance  and  the  computation  time 


is  determined  by  the  slowest  node. 

Similar  to  Bitonic  sort  and  Shell  sort,  the  unsorted 
list  is  stored  evenly  across  the  cube  initially,  i.e.,  each 
node  has  M/N  elements  in  arbitrary  order.  After 
sorting,  the  sorted  list  will  be  stored  in  the  cube  in 
consecutive  order  but  each  node  may  have  different 
number  of  elements.  The  parallel  quicksort  works  as 
follows:  First,  (.Af— 1)  splitting  keys  are  selected  us¬ 
ing  a  presorting  algorithm.  Second,  the  list  in  each 
node  is  split  into  two  parts  according  to  a  proper 
splitting  key  and  exchanged  with  its  neighbor  along 
a  certain  dimension.  This  splitting  process  repeats 
logiV  times.  At  last,  each  node  sorts  its  local  list 
with  a  fast  sequential  sorting  algorithm. 

The  algorithm  and  its  time-performance  are  de¬ 
scribed  M  follows: 

1.  Choose  k  samples  randomly  from  the  ^-element 
sublist  of  each  node.  Find  the  largest  and  the 
smallest  elements  in  the  sample,  let  them  be 
(max,  min). 

2.  Perform  the  maximum  and  minimum  operations 
on  each  node’s  {max, min)  pair  globally  across 
the  nodes  to  find  the  mzocimum  and  the  mini¬ 
mum  elements,  say  {gmax,gmin),  in  the  whole 
sample.  This  global  operation  is  done  in  a  bi¬ 
nary  tree  manner,  and  it  needs  log  IV  -|-  1  com¬ 
munication  steps,  i.e.,  (log  IV  -f  1)  •  (T,o«t«(2)  -f 

T'comp—tx  (2)). 

3.  Equally  divide  {gmax  —  gmin)  into  IV  —  1  in¬ 
tervals  and  use  the  N  boundary  elements  as  the 
splitting  keys.  Since  each  node  only  needs  log  N 
splitting  keys,  each  node  can  use  a  binary  search 
to  find  all  its  keys  in  log  N  steps. 

4.  for  i  :=  dim  —  1  to  0  step  —1  do 

compare  my  sublist  with  the  i-th  splitting  key 
and  divide  it  into  two  sublists, 
if  {myjnid  <  nexyh6or[t]) 
exchange  the  larger  sublist  with  the  smaller 
sublist  of  my  i-th  bit  neighbor 
else 

exchange  the  smaller  sublist  with  the 
larger  sublist  of  my  i-th  bit  neighbor 
endif. 

5.  Sequential  sort  the  sublist  locally. 

The  main  part  of  the  parallel  Quick  sort,  i.e,  step 
(4),  takes 

logN  .  (2T,<,ut.(^)  +  Tcomp-M^)) 

iV  N 

for  the  best  case,  assuming  each  node  always  hold 
M/N  elements  after  each  compare-splitting  step. 
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With  a  good  set  of  splitting  keys,  the  parallel 
Quicksort  has  a  best/average  time  performance:  20 

18 

MM  M 

Tgui.fc  =8.5— log— +  7.41— logW  +  767.21ogAf  (5)  16 

li  ti  ri 

The  sampling  time  in  step  1  is  proportional  to  the  Speedup 
sample  size  in  each  node,  which  is  negligible.  The  first  12 
term  of  the  equation  is  the  sequential  sorting  time  in 
step  (5).  And  the  real  timing  in  the  case  of  =  16 
is  shown  in  figure  1.  * 

6 

Performance  Comparison  and  Analysis  < 

2 

We  have  measured  execution  time  for  each  of  the 
above  three  sorting  algorithms  for  =  8,  16,  32  and  ° 

64,  and  M  ranges  from  2®  to  2^®.  Figure  1  shows  the 
timing  curves  with  the  execution  time  versus  logM 
for  N  =  16. 


7  8  9  10  11  12  13  14  15  16  17  18  19 

logj  M :  Problem  Size  — ♦ 

Figure  2.  Speedup  curves  on  a  16-node  S2010 


Figures  2  and  3  are  speedup  curves  calculated  from 
real  execution  time  of  the  three  parallel  algorithms  for 
JV  =  16  and  64,  respectively. 


Figure  1.  Timing  on  a  16-node  S2010 

Input  lists  are  generated  by  using  UNIX  random 
routine.  The  execution  time  is  determined  by  the 
sorting  time  of  the  slowest  node.  The  down-loading 
and  up-loading  (input  and  output)  time  is  not  con¬ 
sidered  in  our  experiments. 

From  these  speed-up  curves,  it  is  observed  that 
the  increasing  communication  overhead  degrades  the 
sorting  speed  of  small  lists  (for  lists  with  IK  elements 
or  less)  when  the  machine  size  increases.  In  the  case 
of  Bitonic  sort,  which  has  the  lowest  communication 
overhead  and  is  the  fcistest  algorithm  for  small  lists, 


logj  M :  Problem  Size  — > 


Figure  3.  Speedup  curves  on  a  64-node  S2010 

the  ratio  of  the  communication  time  to  the  computa¬ 
tion  time  is  more  than  two  for  small  lists. 

In  the  case  of  Shell  sort,  the  timing  equation  4  does 
not  include  the  time  to  broadceist  the  boolean  flag 
which  indicates  if  any  exchange  has  been  made  in  each 
compare-exchange  step  in  (3)  of  the  Shell  sort  algo¬ 
rithm.  This  value  may  be  negligible  when  M  ^  AT, 
but  becomes  the  major  overhead  when  N  gets  large, 
or  when  M  ~  AI.  Broadcasting  is  done  in  binary  tree 
manner  which  requires  (logAI  +  1)  steps  of  message 
transmission  after  each  compare-exchange  step.  For 
the  worst  case,  the  broadcast  overhead  is  as  high  as 
253.3{N  —  2y/N -f  l)(log  AI-|- 1).  Although  the  parallel 
sorting  part  oiTsheU  has  an  0{M)  time  complexity  in 
the  worst  case,  the  broadcasting  step  may  save  a  lot  of 
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compare-exchange  steps  for  random  data  input.  The 
empirical  result  shows  that  the  parallel  Shell  sort  can 
achieve  linear  speed-up  for  large  problem  size  random 
data.  See  figures  2  and  3. 

As  to  parallel  Quicksort,  empirical  result  shows 
that  a  presorting  procedure  as  simple  as  the  aboved- 
mentioned  splitting  key  selection  mechanism  can  re¬ 
sult  in  very  good  load  balancing,  thus  a  super  lin¬ 
ear  speedup  is  observed.  A  more  complicated  pre¬ 
sorting  algorithm  based  on  the  bitonic  sort  has  also 
been  attempted,  but  it  results  in  a  higher  overhead, 
i.e.,  0{k  \og^  N),  for  k  scunples  each  node,  and  a 
worse  load  balancing  than  the  above  algorithm.  Con¬ 
sequently,  we  can  conclude  that  for  random  data, 
the  simple  equally-divided  key  selection  method  can 
achieve  the  best  performance  of  the  parallel  Quick¬ 
sort.  See  figures  2  and  3. 

Unlike  the  other  two  sorting  algorithms,  the  sorted 
list  obtained  from  this  algorithm  is  not  evenly  dis¬ 
tributed  in  each  node.  This  is  not  a  problem  if  the 
sorted  list  is  up-loaded  to  the  host  machine  without 
further  computation.  On  the  other  hand,  if  sorting 
is  just  a  part  of  the  computation  and  the  sorted  list 
needs  to  stay  in  the  cube  for  later  use,  the  unbalanced 
data  distribution  may  not  be  desirable.  In  this  case, 
we  may  need  to  rearrange  the  elements  so  that  each 
node  keeps  the  same  number  of  elements.  The  cost 
for  the  redistribution  needs  further  investigation. 

Conclusions 

We  have  implemented  three  sorting  algorithms  on 
S2010,  a  distributed-memory  message-passing  MIMD 
machine.  These  algorithms  are  chosen  because  they 
can  be  parallelized  easily  on  a  mesh  or  hypercube  ar¬ 
chitecture.  Each  sorting  algorithm  is  a  combination 
of  parallel  and  sequential  sorting  methods  and  has  a 
different  time  complexity. 

In  the  parallel  sorting  component,  Bitonic  sort 
takes  a  fixed  number  of  steps  to  sort  despite  of  the  in¬ 
put  data  pattern,  with  time  complexity  0{^  log^  N). 
Parallel  Quicksort  has  a  performance  that  depends  on 
how  good  splitting  keys  are  selected,  and  it  is  shown 
that  with  a  little  overhead  of  presorting  this  algorithm 
can  achieve  very  good  load  balancing,  and  thus  a  best 
time  performance  O(^logfV).  The  performance  of 
the  Shell  sort  is  constrained  by  its  second  part,  the 
odd-even  transposition  sort,  which  is  a  slow  sequen¬ 
tial  sorting  algorithm.  Nevertheless,  by  taking  the 
advantage  of  the  asynchronous  nature  of  S2010,  the 
parallel  version  of  the  Shell/odd-even  transposition 
sort  can  be  as  good  as  the  parallel  Quicksort  when 


In  the  sequential  sorting  part,  which  is  performed 
on  the  M/N  elements  locally  on  each  processor  as  the 
first  step  in  Bitonic  sort  and  Shell  sort,  or  on  varied 
number  of  elements  locally  as  the  last  step  in  parallel 
Quicksort,  has  a  time  complexity  0{^  log  ^). 

The  overall  performance  of  the  three  algorithms  is 
a  combination  of  this  sequential  performance  and  the 
parallel  sort  performance.  We  found  from  our  em¬ 
pirical  results,  for  relatively  small  size  of  problems, 
M/N  <  64  say.  Bitonic  sort  is  the  best  because  it 
has  the  lowest  synchronization  overhead  in  the  algo¬ 
rithm.  The  parallel  Quicksort  is  the  best  for  large 
problem  size,  which  agrees  with  our  analysis.  It  is  in¬ 
teresting  to  learn  that  Shell  sort  outperforms  Bitonic 
sort  in  the  case  of  large  problem  size,  which  is  mainly 
due  to  the  fact  that  the  Shell  sort  often  terminates 
the  sorting  steps  earlier.  Both  the  parallel  Quicksort 
and  Shell  sort  achieve  linear  speed-up  comparing  to 
sequential  qsort  for  large  problem  size  on  8  to  64  pro¬ 
cessor  machines.  Shell  sort  is  the  slowest  among  all 
for  the  small  problem  sizes  because  of  its  high  syn¬ 
chronization  overhead. 
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Abstract 

A  parallel  algorithm  for  sorting  n  elements 
evenly  distributed  over  2**  =  p  nodes  of  a  d  di¬ 
mensional  hypercube  is  given.  The  algorithm  en¬ 
sures  that  the  nodes  always  receive  equal  number 
of  elements  (n/p)  at  the  end,  regardless  of  the 
skew  in  data  distribution. 


I.  Introduction 

This  paper  addresses  the  problem  of  sorting  n 
elements  evenly  distributed  over  2**  =  p  nodes  of 
a  d  dimensional  hypercube,  where  n  »  p.  The  n 
elements  are  defined  as  sorted  whenever  a  global 
order  is  obtained  such  that  for  p  —  1  >  z  >  j  >  0 
any  element  in  node  z  is  greater  than  any  element 
in  node  j,  and  within  each  node  n/p  elements 
are  sorted  among  themselves.  Main  contributions 
presented  in  this  paper  are: 

1.  An  enumeration  sorting  algorithm  which 
ensures  that  the  nodes  always  receive  equal 
number  of  elements  (n/p)  at  the  end, 
regardless  of  the  skew  in  data  distribu¬ 
tion.  Running  time  of  the  algorithm  is 
C)((nlogn)/p  -t-  plog^n)for  uniform  data 
distribution. 

2.  A  parallel  selection  algorithm  which  deter¬ 
mines  the  p  -  1  partitioning  keys  used  in 
sorting  in  0(plog*n)  time.  Best  known 
previous  result  on  selection  [l]  would  yield 
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a  time  complexity  of  0{p  log^p  log(n/p)) 
for  the  same  problem. 

3.  A  communication  algorithm  used  in  sort¬ 
ing  which  eliminates  the  store- and-forward 
overhead  by  making  use  of  the  distance- 
el  communication  capability  of  the  iPSC/2 
hypercube  system.  This  algorithm  is  sub¬ 
stantially  faster  than  the  store- and-forward 
scheme. 

Implementation  results  show  that  the  sorting 
algorithm  based  on  Items  2  and  3  above  performs 
better  than  the  hyperquicksort  algorithm  for  large 
«[2]. 

Parallel  sorting  algorithms  for  distributed 
memory  hypercube  multiprocessors  were  previ¬ 
ously  given  in  [3,  2,  4,  5,  1,  6].  The  enumera¬ 
tion  sorting  algorithms  given  in  [2,  5,  4]  do  not 
address  the  problem  of  distributing  data  equally 
across  the  nodes;  nodes  do  not  necessarily  finish 
the  sort  with  n/p  elements  each,  but  depending 
on  the  skew  and  initial  ordering  of  data,  some 
nodes  may  end  up  with  more  than  n/p  elements. 
This  may  lead  to  two  problems:  1)  load  imbal¬ 
ance  during  the  sort,  since  some  nodes  have  to 
process  more  than  n/p  elements,  2)  insufficient 
amount  of  memory  to  complete  the  sort  in  some 
nodes.  For  example,  hyperquicksort  which  is  com¬ 
monly  known  as  the  fastest  practical  sorting  al¬ 
gorithm  performs  poorly  in  some  cases  such  that 
nearly  all  n  elements,  instead  of  n/p  end  up  in 
one  node,  limiting  both  the  speedup  and  the  use¬ 
ful  range  for  n  [2]. 

In  the  sorting  algorithm  described  here  ele¬ 
ments  are  evenly  distributed  across  the  p  nodes 
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such  that  £com  stait  to  finish  each  node  processes 
njp  elements  exact.  This  is  accomplished  by  us¬ 
ing  the  balanced  partition  keys  for  redistribut¬ 
ing  data  among  the  nodes,  contrary  to  other  al¬ 
gorithms  which  select  the  partition  keys  either 
randomly  or  by  sampling  the  elements.  The  bal¬ 
anced  partition  keys  can  be  determined  in  the  fol¬ 
lowing  manner:  Let  L[1  •  ■  -  n]  be  the  final  sorted 
list,  the  result  of  the  sort.  The  p  -  1  elements 
L[ifcn/p]  (h  =  1, . . .  ,p  —  1)  correspond  to  the  bal¬ 
anced  partition  keys,  since  any  two  keys  Z[ibn/p] 
and  X[(A:  -f-  l)n/p]  have  exactly  n/p  elements  be¬ 
tween  them  in  the  final  sorted  Ust  L[l---n,].  If 
p  —  1  partitioning  keys  are  available,  every  node 
can  send  all  elements  greater  than  or  equal  to 
£[hn/p]  and  smaller  than  L[{k  -|-  l)n/p]  to  node 
k,  so  that  the  elements  are  globally  ordered  and 
the  total  number  of  elements  received  by  node 
k  is  exactly  n/p.  Therefore,  main  steps  of  the 
sorting  algorithm  can  be  given  as  follows: 

1.  Quicksort:  Each  node  independently 
quicksorts  the  n/p  elements  initially  resid¬ 
ing  in  its  memory  to  form  a  sorted  list 
i4[0  ■  •  •  n/p  -  1]. 

2.  Select  Peurtitioning  Keys:  Nodes  run 
a  parallel  selection  algorithm  to  determine 
the  p  -  1  partitioning  keys  £[l:n/p]  {k  = 
1, . . .  ,p  -  1).  This  algorithm  described  in 
Section  III  runs  in  O(plog^  n)  average  time. 

3.  Global  Exchange:  Each  node  finds  the 
insertion  point  of  the  p  — 1  partitioning  keys 
in  its  list  j4[0---n/p  —  1].  This  will  par¬ 
tition  the  list  into  p  segments,  in  general. 
The  segment  between  the  insertion  points 
of  L[knlp]  and  L[{k-\-l)nlp]  is  sent  to  node 
A:  (Jt  =  1,  •  •  ■  ,p  -  2).  Similarly,  the  segment 
below  the  insertion  point  of  Lfn/p]  is  sent 
to  node  0,  and  the  segment  above  the  inser¬ 
tion  point  of  2i[(p  —  l)u/p]  is  sent  to  node 
p-  1. 

4.  Binary  IVee  Merge:  Each  node  hsis  now 
received  p  -  1  sorted  segments  from  other 


nodes  and  one  from  itself.  Each  node  forms 
a  single  sorted  list  out  of  these  p  segments 
in  C>((nlogp)/p)  time  using  binary  tree 
merge. 

5.  End  of  Algorithm. 

Note  that  the  time  complexity  of  Steps  1,  2,  and 
4  add  up  to  0((nlogn)/p-l-plog*  n).  Time  com¬ 
plexity  of  Step  3  depends  on  global  data  distri¬ 
bution.  For  uniform  distribution,  it  is  0{n/p). 
In  the  next  section  we  describe  Step  3  in  more 
detail.  The  parallel  selection  algorithm  and  im¬ 
plementation  results  are  presented  in  Sections  III 
and  IV,  respectively. 


II.  Global  Exchange 

Let  j4J  (^  =  0, 1,  •  •  •  ,p  —  1)  denote  the  p  sorted 
segments  in  node  i  induced  by  the  balanced  parti¬ 
tion  keys.  In  the  global  exchange  step  of  the  sort, 
segments  are  exchanged  among  the  nodes  such 
that  each  node  i  sends  its  segment  A\  to  node  t. 
These  exchanges  must  be  ordered  such  that  seg¬ 
ments  do  not  collide  and  block  each  other  on  the 
hypercube  links.  We  perform  this  task  as  in  the 
following:  In  the  iPSC/2  hypercube,  each  node 
is  equipped  with  a  direct  connect  module  (DCM) 
which  allows  non-neighboring  nodes  to  communi¬ 
cate  directly  [7],  instead  of  using  the  store-and- 
forward  scheme  [8].  A  DCM  is  basically  a  (d-t- 1) 
input,  (d  -I-  1)  output  crossbar  switch.  The  d 
input-output  pairs  of  the  DCM  are  connected  to 
the  d  neighbors  of  the  node  through  hypercube 
links.  The  remaining  input-output  pair  is  con¬ 
nected  to  the  internal  bus  of  the  node,  hence  to 
its  memory.  A  DCM  can  be  set  up  so  that  a  mes¬ 
sage  coming  from  one  link  can  be  immediately 
directed  to  another  link,  thereby  eliminating  the 
store- and- for  ward  overhead.  Our  measurements 
on  iPSC/2  indicate  that  this  scheme  is  as  fast 
as  near-neighbor  communication  if  all  the  links 
in  the  communication  path  are  available.  A  fixed 
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touting  scheme  called  e-cube  algorithm  is  used  for 
touting  the  messages  in  iPSC/2  [7],  where  bit  by 
bit  logical  exclutive-or  of  the  source  and  destina¬ 
tion  node  numbers  gives  the  routing  tag.  Nonzero 
bit  positions  in  the  routing  tag,  when  read  &om 
tight  to  left,  give  the  hypetcube  coordinate  direc¬ 
tions  a  message  goes  through.  For  example,  if  the 
touting  tag  is  r  =  {TiT3T2TirQ)  =  (01011),  mes¬ 
sage  first  goes  in  the  coordinate  direction  0,  then 
direction  1,  then  direction  3,  and  finally  arrives 
at  its  destination. 

By  making  use  of  the  DCMs  and  the  e-cube 
scheme,  the  following  global  exchange  algorithm 
coordinates  the  exchange  of  segments  so  that 
they  ate  delivered  to  their  destinations  directly 
and  that  the  communication  paths  used  by  the 
segments  ate  always  disjoint.  Thus,  segments 
never  block  each  other  on  the  network  links.  Each 
node  distributively  executes  the  following,  where 
©  denotes  an  exclusive-or  operation; 

Let  z  be  this  node’s  id,  and  let 
'^*=o,...,p-i  p  —  1  segments  in  node  z 

for  A  =  1 , . . . ,  p  -  1 

send  segment  Al  to  node  z®  k 
receive  segment  from  node  z®  k 
wait  for  receive  and  send  to  complete 
sync 
endfot 

Processors  wait  at  the  sync  instruction  until  it  is 
executed  by  all  of  p  them,  which  ensures  that  no 
processor  gets  ahead  and  occupy  links  used  by 
other  processors.  It  may  be  verified  from  Fig.  1 
that  the  segments  follow  disjoint  paths.  An  0{n) 
bottleneck  can  be  present  in  the  global  exchange 
algorithm  for  some  data  distributions.  For  exam¬ 
ple,  this  will  happen  for  a  case  where  each  node 
has  to  send  its  entire  n/p  size  list  to  another  node. 
The  bottlenecks  may  be  eliminated  by  removing 
the  sync  statement  or  by  using  different  routing 
algorithms,  but  this  is  a  subject  of  further  anal¬ 
ysis  and  will  not  be  discussed  here. 

An  interesting  feature  of  our  sorting  algorithm 
is  the  ability  to  overlap  communication  and  com¬ 


putation  in  the  global  exchange  and  binary  tree 
merge  steps  using  asynchronous  communication 
primitives;  As  soon  as  node  z  receives  its  first  seg¬ 
ment  a',  it  begins  merging  the  pair  of  segments 
AJ  and  A*,  while  two  more  segments  arrive  to 
the  node  in  parallel  with  the  merge.  Merging  of 
segment  pairs  continue  in  this  pipelined  fashion 
until  all  of  the  segments  are  exchanged. 

III.  Selecting  the  Partition  Keys 

The  partitioning  problem  here  is  selecting  the 
n/p-th,  27i/p-th,...(p  -  l)Ti/7>-th  largest  keys  out 
of  p  sorted  lists  of  size  n/p  each.  A  partitioning 
key,  namely  L[kn/p]  {k  —  1,  •  ■  •  ,p-  1),  can  be  de¬ 
termined  in  a  fashion  similar  to  the  ordinary  bi¬ 
nary  search.  An  informal  description  of  the  algo¬ 
rithm  will  be  given  first:  Assume,  a  key  X  is  pro¬ 
posed  as  the  partition  key.  Each  node  determines 
the  number  of  elements  smaller  than  X  (referred 
to  as  local  rank)  in  its  sorted  list  A[0  •  •  •  n/p  -  1]. 
Local  rank  can  be  determined  in  log(n/p)  com¬ 
parisons  using  binary  search.  Summation  of  the 
p  local  ranks  of  X  gives  its  global  rank,  hence  its 
position  in  the  final  sorted  list  L[\  ■  ■  -  n].  If  the 
global  rank  of  X  is  greater(smailer)  than  kn/p, 
a  new  candidate  smaller(greater)  than  X  is  pro¬ 
posed  as  the  partition  key,  and  the  procedure  is 
iterated  until  L[kn/p]  is  found.  The  number  of  it¬ 
erations  will  be  log2  n  on  the  average,  if  candidate 
partition  keys  are  chosen  properly.  This  idea  is 
the  basis  of  the  partitioning  algorithm  presented 
next.  We  give  only  a  sketch  of  the  algorithm. 
Exact  details  can  be  found  in  [6]: 

1.  Initialize:  Let  A[0--  n/p  —  1]  be  the 
sorted  list  of  n/p  elements  in  node  i  (z  = 
0, . . .  ,p  —  1).  Let  the  local  variables  mzn[jfc] 
and  max[k]  {k  =  1, . . .  ,p—  1)  be  the  point¬ 
ers  for  the  sorted  list  A[0  ■  ■  •  n/p  —  1].  Dur¬ 
ing  the  iterations,  the  local  search  space 
for  the  A-th  partition  key  L[kn/p]  will  al¬ 
ways  be  between  minfJb]  and  max[k]  such 
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that  j4[mxn[A:]]  <  X[A:n/p]  <  j4[maz[ib]]. 
Set  min[k]  =  -1,  max[k]  =  n/p  for  Jb  = 
1 , . . . ,  p  -  1 .  Initially,  each  node  will  pro¬ 
pose  il[JI;n/p^]  as  a  candidate  for  the  ib-th 
partition  key  L[kn/p]  for  ib  =  1, . . .  ,p  -  1. 

2.  Transpose:  Each  node  i  (*  =  0, ...,p  - 
1)  is  now  holding  a  candidate  for  the  k-th 
partition  key.  All  p  candidates  associated 
with  A;-th  partition  key  are  moved  to  node 
k  for  k  =  l,...,p  —  1.  Node  0  gets  only 
NIL  values.  This  step  can  be  completed  in 
0(plogp)  time. 

3.  Select  Median  of  the  Candidates: 

Node  k  is  now  holding  p  candidates  asso¬ 
ciated  with  the  fc-th  partition  key.  Node  k 
quicksorts  these  keys  and  then  determines 
their  median  in  O(plogp)  time.  (The  me¬ 
dian  key  will  be  referred  to  as  the  k-th  can¬ 
didate,  meaning  that  it  is  the  candidate  for 
the  jb-th  partition  key.)  Candidates  other 
than  the  median  are  discarded.  Node  0  is 
idle  at  this  step. 

4.  Broadcast:  Each  node  k  (k  =  l,...,p-l) 
broadcasts  the  median  key  to  test  of  the 
nodes  0, 1, ...  ,p  -  1.  This  step  takes  0(p) 
time. 

5.  Local  Rank  Computation:  Every  node 
now  has  a  copy  of  the  p  -  1  candidates. 
Local  tank  R[k]  of  the  l:-th  candidate 
in  each  node  is  determined  by  a  binary 
search  in  j4[0  ■■  nip  -  1].  This  step  takes 
0(plog(7i/p))  time  total. 

6.  Global  Rank  Computation:  The  p  local 
ranks  of  the  ib-th  candidate  ate  summed  in 
log2P  communication  and  addition  steps  to 
give  its  global  tank  G[l:]  for  A:  =  1, . . .  ,p— 1, 
resulting  in  0(plogp)  overall  time  for  this 
step.  If  G[A:]  =  kn/p,  then  A:-th  candidate 
is  the  balanced  partition  key  L[kn/p]. 

7.  Reduce  the  Search  Space:  If  G[A:]  > 
kn/p,  it  is  known  that  the  A:-th  candidate  is 


greater  than  L[ibn/p].  Therefore,  each  node 
decreases  maz[ib]  pointer  to  the  candidate’s 
insertion  point  in  its  list  A[0-  -n/p  -  1]. 
Likewise,  if  G[jb]  <  kn/p,  then  each  node 
increases  min[Jb]  pointer  to  the  candidate’s 
insertion  point  in  its  list  A[0  •  •  •  n/p  —  1]. 

8.  Propose  New  Candidates: 

For  k  =  l,...,p  -  1,  each  node  proposes 
i4[(Tnaz[A;]  -I-  mtn[ib])/2]  as  a  candidate  and 
the  next  iteration  begins  from  Step  2  until 
all  p  —  1  balanced  partition  keys  are  found. 

9.  End  of  Algorithm. 

Each  iteration  of  the  algorithm  takes  O(plogn) 
time,  and  the  number  of  iterations  is  log2  n  on 
the  average,  giving  an  average  time  complexity  of 
0(p log*  7i).  Note  that  the  previous  result  in  [l] 
yields  0(p  log*p  log(n/p))  time  complexity  for 
the  same  problem. 

The  transpose  operation  in  Step  2  is  a  gen¬ 
eral  hypercube  algorithm  for  distributing  p  val¬ 
ues  in  every  node  to  the  rest  of  the  nodes  in 
in  0(plogp)  time.  Each  of  the  p  values  in  any 
given  node  is  addressed  to  a  different  node.  Let 
val  in  tuple  <  val,dsi,3Tc  >  denote  the  value 
to  be  sent  from  node  src  to  node  dst.  In  any 
given  node  z,  there  are  p  tuples  <  valj,j,z  > 
{j  =  0, 1, ...  ,p  -  1)  initially.  Upon  completion 
of  the  algorithm,  node  z  receives  the  p  tuples 
<  val\,z,  j  >  [j  =  0, 1, . . .  ,p  —  1)  addressed  to  it 
torn  other  nodes.  For  example,  assume  that  the 
following  values  are  present  on  a  4  node  system 
initially: 


i 

Node  0 

Node  1 

Node  2  Node  3 

0 

NIL 

NIL 

NIL 

NIL 

1 

12 

46 

23 

19 

2 

28 

3 

35 

37 

3 

8 

57 

18 

66 

After  transpose,  contents  of  the  nodes  change  as 
follows: 

Node  0  Node  1  Node  2  Node  3 
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NIL 

12 

28 

8 

NIL 

46 

3 

57 

NIL 

23 

35 

18 

NIL 

19 

37 

66 

The  transpose  algorithm  can  be  described  as  in 
the  following:  Let  z  =  {zd-\  •  •  •  zo)  be  the  binary 
representation  of  the  node  z’s  id,  and  T[0  •  •  -p-l] 
denote  the  list  of  p  tuples  <  valj,j,z  >  residing 
in  node  z. 

for  m  =  0, 1, . . .  ,d  -  1 

1.  split  T  into  two  lists  B  and  B'  such  that 
B  contains  tuples  whose  j  field  agree  with 
(zj-i'  -zo)  in  m-th  bit  position,  and  B' 
contains  tuples  which  do  not, 

2.  send  B'  to  node  (z<f_i  •  •  •  z^ •  •  •  zq)  on  co¬ 
ordinate  m, 

3.  receive  C  from  node  {zj-i  ■  •  •  zq)  on 

coordinate  m 

4.  T^-BuC 

IV.  Experimental  Results  and 
Conclusions 

The  parallel  sorting  algorithm  was  imple¬ 
mented  for  sorting  32  bit  integers  on  an  8  node 
386  processor  based  iFS C/2  hypercube  multipro¬ 
cessor.  The  hyperquicksort  algorithm  was  also 
implemented  for  comparison  [2].  The  global  ex¬ 
change  and  the  binary  tree  merge  steps  were  im¬ 
plemented  to  allow  communication  and  compu¬ 
tation  overlap  as  described  earlier.  The  elements 
to  be  sorted  are  randomly  generated  in  the  nodes. 
Locally  in  each  node,  unsorted  elements  are  uni¬ 
formly  distributed.  To  observe  the  effect  of  inter¬ 
processor  communication  during  the  global  ex¬ 
change  step,  three  different  global  data  distribu¬ 
tions  were  used.  The  BA £A JVCBB  distzibution  is 
adjusted  such  that  the  p  -  1  partition  keys  split 
each  list  of  size  n/p  to  p  equal  sized  segments. 


Hyperquicksort  reaches  its  best  performance  at 
this  distribution.  Note  also  that  the  quicksort 
and  merge  steps  of  our  algorithm  and  of  hyper¬ 
quicksort  are  supposed  to  take  equal  time  for  this 
distribution.  Table  1,  columns  3  and  4  show  the 
sort  times  for  hyperquicksort  and  our  algorithm, 
respectively.  Results  show  that  as  n  gets  large, 
our  algorithm  sorts  faster  (faster  cases  are  indi¬ 
cated  by  *).  Thus,  just  by  the  virtue  of  the  global 
exchange  algorithm  described  in  Section  II,  bet¬ 
ter  results  were  obtained  for  large  n.  The  column 
Q  indicates  the  quicksort  and  P  indicates  the  par¬ 
tition  time  of  our  algorithm,  and  M  indicates  the 
global  exchange  and  merge  time  of  our  algorithm. 
Number  of  iterations  made  by  the  partitioning  al¬ 
gorithm  is  indicated  in  the  last  column.  Table  1 
shows  that  the  number  of  iterations  is  approx¬ 
imately  log2  n  for  the  BALANCED  distribution 
case. 

The  WORST  distribution  is  adjusted  to  induce 
an  0(n)  bottleneck  in  the  global  exchange  step  of 
our  algorithm.  All  of  the  n/p  elements  initially 
in  node  *  (i  =  0, . . . ,  p  —  1)  have  to  move  to  node 
i  -1-  l(mod  p)  after  sorting.  Thus,  in  each  node  i, 
segment  A\  has  a  size  n/p  it i  =  t  +  l(Tno<ip),  and 
has  a  size  Oif^  ^  i+l{mod  p).  Since,  exchange  of 
segments  are  serialized  with  the  sync  instruction, 
it  will  take  0{n)  time  to  complete  the  global  ex¬ 
change  step  for  this  distribution.  Table  2  shows 
that  for  large  n  hyperquicksort  is  slower.  This 
was  due  to  load  imbalance  in  the  merge  step  of 
hyperquicksort;  some  nodes  had  to  merge  more 
than  n/p  elements,  while  every  node  finished  the 
sort  exactly  with  n/p  elements  in  our  algorithm. 

In  the  BEST  distribution  initially  all  elements 
in  node  *  are  greater  than  all  elements  in  node 
j,  if  i  >  j.  Since  data  is  already  globally  or¬ 
dered,  exchange  of  segments  during  the  global 
exchange  step  is  eliminated.  Table  3  shows  that 
for  large  n  our  algorithm  again  performs  better 
due  to  its  smaller  communication  overhead,  and 
due  to  load  imbalance  in  the  merge  step  of  hy¬ 
perquicksort.  For  example,  for  the  case  of  d  =  3 
and  n  =  64,000  (not  shown  in  the  table),  one 
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node  finished  the  sort  with  15,000  elements  and 
one  other  with  1,000  elements  in  hyperquicksoit, 
whereas  every  node  finished  the  sort  exactly  with 
8,000  elements  in  our  algorithm. 

Results  show  that  the  sorting  algorithm  pre¬ 
sented  in  this  paper  obtains  very  competitive 
speedups,  and  it  has  the  advantage  of  equally  dis¬ 
tributing  data  over  the  hypercube.  Main  weak¬ 
ness  of  the  algorithm  is  the  relatively  high  parti¬ 
tioning  overhead  for  small  n.  However,  in  prac¬ 
tice  the  criteria  c  >  |ikn/p  -  G[jb]|  can  be  used 
for  terminating  the  iterations  earlier,  resulting  in 
partitions  of  size  n/p±  2c  at  worst. 
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Figure  1:  Some  steps  of  the  Global  Exchange  on 
a  3-cnbe 


Table  1;  For  the  BALANCED  ^tiihxition  case, 
execution  times  of  Balanced-Partition  Sort  and 
Hyperquicksort  (msec.) 


Table  2:  For  the  WORST  distribution  case,  exe¬ 
cution  times  of  Balanced-Partition  Sort  and  Hy- 
perquicksort  (msec.) 


d 

nxlO*’ 

Hyp. 

Total 

Q 

P 

M 

I 

1 

4 

77 

104 

56 

31 

17 

13 

1 

8 

158 

178 

117 

32 

29 

13 

1 

16 

325 

338 

246 

33 

59 

14 

1 

30 

639 

*  639 

495 

37 

107 

15 

1 

60 

1320 

►-* 

00 

00 

1034 

38 

216 

16 

2 

4 

51 

105 

27 

20 

13 

2 

8 

98 

142 

55 

54' 

33 

12 

2 

16 

196 

236 

118 

59 

59 

13 

2 

30 

374 

409 

234 

68 

107 

15 

2 

60 

771 

778 

495 

74 

209 

16 

2 

100 

1330 

*1294 

871 

78 

345 

17 

2 

120 

1583 

*1524 

1034 

78 

412 

17 

3 

4 

38 

132 

13 

95 

24 

13 

3 

8 

64 

168 

25 

109 

34 

15 

3 

16 

120 

224 

54 

117 

53 

16 

3 

30 

222 

317 

109 

118 

90 

16 

3 

60 

448 

524 

234 

125 

165 

17 

3 

100 

753 

792 

401 

126 

265 

17 

3 

120 

917 

943 

496 

133 

314 

18 

3 

150 

1152 

*1150 

627 

133 

390 

18 

3 

200 

1522 

871 

136 

515 

18 

3 

240 

1791 

1034 

143 

614 

19 

3 

300 

2252 

1346 

141 

765 

19 

Table  2;  For  the  WORST  distribution  case,  exe¬ 
cution  times  of  Balanced-Partition  Sort  and  Hy¬ 
perquicksort  (msec.) 


d 

nxlO® 

Hyp. 

Total 

Q 

P 

M 

I 

1 

4 

78 

119 

56 

54 

9 

23 

1 

8 

159 

188 

118 

55 

15 

23 

1 

16 

327 

335 

246 

60 

29 

25 

1 

30 

645 

*613 

495 

65 

S3 

27 

1 

60 

1315 

*1207 

1034 

69 

104 

29 

2 

4 

54 

147 

26 

98 

23 

22 

2 

8 

102 

194 

55 

98 

41 

22 

2 

16 

207 

302 

117 

107 

78 

24 

2 

30 

397 

493 

234 

117 

142 

26 

2 

60 

803 

900 

495 

126 

^279 

28 

4 

44 

185 

13 

149 

23 

21 

3 

8 

78 

211 

26 

150 

35 

21 

3 

16 

149 

279 

55 

165 

59 

23 

3 

30 

275 

388 

109 

179 

100 

25 

3 

60 

562 

617 

234 

194 

189 

27 

3 

100 

941 

*916 

400 

209 

307 

29 

3 

120 

1139 

*1069 

495 

208 

366 

29 

3 

150 

1431 

*1308 

627 

226 

455 

31 

Table  3:  For  the  BEST  distribution  case,  execu¬ 
tion  times  of  Balanced-Partition  Sort  and  Hyper¬ 
quicksort  (msec.) 


nxlO* 

Hyp. 

Total 

Q 

P 

M 

I| 

1 

4 

75 

115 

56 

54 

5 

23 

1 

8 

155 

180 

117 

55 

8 

23 

1 

16 

318 

322 

246 

60 

16 

25 

1 

30 

631 

*589 

496 

64 

29 

27 

1 

60 

1302 

*1161 

1033 

70 

58 

29 

2 

4 

55 

131 

26 

97 

8 

22 

2 

8 

107 

163 

55 

97 

11 

22 

2 

16 

216 

243 

118 

106 

19 

24 

2 

30 

414 

*381 

234 

115 

32 

26 

2 

60 

852 

*682 

495 

125 

62 

28 

2 

100 

1462 

*1106 

871 

135 

100 

30 

2 

120 

1723 

*1288 

1033 

135 

120 

30 

3 

8 

79 

190 

26 

149 

15 

21 

3 

16 

150 

241 

55 

165 

21 

23 

3 

30 

279 

318 

109 

178 

31 

25 

3 

60 

566 

*480 

233 

194 

53 

27 

3 

100 

952 

•694 

400 

210 

84 

29 

3 

120 

1155 

*803 

495 

210 

98 

29 

3 

150 

1449 

•969 

626 

224 

119 

31 

3 

200 

1251 

871 

224 

156 

31 

3 

240 

1445 

1034 

224 

187 

31 

3 

300 

1816 

1346 

240 

230 
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Abstract 

This  paper  describes  a  generalized  version  of  a  previously 
published  parallel  sort  algorithm,  "parallel  shell 
merge". [3]  This  version  was  implemented  on  the 
JPL/Caltech  Mark  III  hypercube  concurrent  processor. 
Each  node  starts  out  with  a  sublist  of  items  to  be  sorted 
first  internally,  and  then  among  the  other  nodes. 
Parallel  shell  merge  is  an  algorithm  used  to  produce  a 
whole  sorted  list  across  the  hypercube  once  each  sublist 
is  sorted  internally.  This  version  is  general  in  the  sense 
that  it  allows  sublists  of  very  different  sizes  to  be  sorted 
as  well  as  being  able  to  handle  balanced  sublists.  This 
general  version  performs  quite  well  when  it  is  used  to 
sort  balanced  loads  of  data;  however,  there  are  some 
efficiency  losses  due  to  the  generalization,  but  they  are 
acceptable. 

I.  Background 

A  parallel  prototype  of  SEQGEN,  the  software  that 
verifies  and  expands  high-level  activities  into  low-level 
command  sequences  for  JPL  flight  projects,  is  in  the 
process  of  development  on  the  JPL/Caltech  Mark  III 
hypercube  concurrent  processor.!  1] [2]  The  purpose  of 
this  prototype  is  to  show  that  the  process  of  generating 
commands  and  sending  them  to  the  spacecraft,  generally 
called  "uplink",  can  be  greatly  sped  up  utilizing  parallel 
computers.  Since  SEQGEN  spends  a  significant  portion 
of  its  time  sorting  commands  in  time  order,  a  parallel 
sort  algorithm  is  necessary  for  the  best  possible  parallel 
speedup. 

II.  Parallel  Sort 

The  parallel  sort  works  as  follows.  Each  node  of  the 
hypercube  starts  out  containing  a  sublist  of  commands  in 
which  there  are  time  fields  to  be  sorted.  Each  sublist  can 
be  of  any  arbitrary  size.  The  algorithm  to  be  described  is 
a  generalization  of  shell  merge.  [3]  The  algorithm's 
prerequisite  is  that  the  sublist  on  each  node  is  sorted; 
therefore,  the  sequential  version  of  quick  sort  is  applied 
to  each  sublist  independently  using  the  median-of-three 
method  to  select  a  pivot  point  where  partition  starts. 
Once  each  sublist  is  sorted  independently  by  quick  sort  as 
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illustrated  in  Figure  1,  shell  merge  begins.  These 
sublists  lying  across  the  hypercube  make  up  the 
complete  list.  Shell  merge  manipulates  the  resulting 
sublists  and  sorts  them  in  ascending  order  across  the 
hypercube.  In  other  words,  shell  merge  compares  and 
exchanges  the  items  of  each  pair  of  the  resulting  sublists 
such  that  the  node  on  the  left  would  have  the 
lower-valued  items  while  the  higher-valued  items  migrate 
to  the  node  on  the  right.  The  node  order  is  defined  such 
that  they  reflect  a  m^  of  the  hypercube  topology  onto  a 
one-dimensional  array,  with  the  lowest  order  nodes  on 
the  left.  For  example,  in  Figure  2  the  case  of  an  8-node 
cube  is  shown  to  have  the  following  order  0, 1,  3, 2, 6, 
7,  5,  4.  Thus,  if  node  0  and  node  1  were  to  perform 
compare  and  exchange  as  mentioned  above,  node  0  would 
be  the  node  on  the  left  and  node  1  is  the  node  on  the 
right  as  depicted  in  Figure  3. 
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Figure  4  illustrates  the  scheme  used  in  shell  merge  to 
determine  which  pair  of  nodes  are  to  do 
compare-exchanges  with  one  another.  The  Hrst  part  of 
the  algorithm  accomplishes  most  of  the  sorting  efforts  in 
d  (dimension  of  cube)  steps.  The  algorithm  makes  large 
jumps  to  get  the  items  to  their  destinations  resulting  in 
an  almost-sorted  list.  The  remaining  part  of  the 
algorithm  is  the  mop-up  stage  where  an  algorithm 
resembling  bubble  sort  is  implemented  to  complete  the 
s(mi.[3] 


The  generalization  of  the  parallel  shell  merge  algorithm 
described  in  this  papa  handles  the  case  where  the  sublists 
are  of  very  different  sizes,  e.g.  node  0  has  a  list  with  30 
elements  and  node  1  has  a  list  with  S  elements.  Since 
we  are  using  a  synchronous,  "crystalline"  environment, 
it  is  necessary  to  avoid  deadlocks  in  communication 
between  two  nodes,  as  the  node  with  shorter  sublist  is 
done  and  well  on  its  way  to  other  tasks  while  the  one 
with  the  longer  sublist  waits  indefinitely  for  the  other 
node  to  talk  back  to  it.  If  the  data  is  arranged  in  such  a 
way  that  regardless  of  the  different  list  lengths,  the  node 
with  the  longer  sublist  does  not  need  to  communicate 
with  the  other  node  after  the  other  node  is  done,  then  the 
algorithm  works  fine.  Although  cases  like  this  where 
each  node  has  a  very  different  sublist  size  than  its 
neighbor  node  do  not  make  the  best  use  of  parallelism, 
an  algorithm  should  still  be  general  enough  to  handle 
these  cases,  especially  to  fulfill  our  particular 
application. 

Shell  merge  is  able  to  sort  sublists  of  very  different 
sizes.  This  is  accomplished  by  detecting  when  the  node 
with  the  shorter  sublist  has  just  Finished  comparing  and 
exchanging  data  with  its  parmer  whose  sublist  is  longer 
and  by  stopping  the  node  with  the  longer  sublist  from 
asking  for  data  from  the  node  with  the  shorter  sublist. 
When  the  node  with  the  shorter  sublist  has  traversed  all 
of  its  elements,  the  sorting  between  the  two  nodes  are 
completed;  thus,  no  further  communications  should  be 
attempted  by  the  node  with  the  longer  sublisL  To  detect 
when  the  two  nodes  should  stop  communicating  with 
one  another  (when  the  node  with  the  shorter  sublist  is 
done),  a  flag  called  "end_of_list"  is  set.  This  flag  is  set 
as  soon  as  the  number  of  items  read  from  the  other  node 
is  found  to  be  greater  than  or  equal  to  the  length  of  the 
smaller  of  the  two  sublists.  Since  both  nodes  are  always 
checking  for  the  stopping  point,  once  either  node's 
"end_ofJist”  flag  is  set,  the  communication  ends. 

When  shell  merge  is  completed,  node  0  is  supposed  to 
have  the  sublist  with  the  lowest  items  while  the  node  at 
position  (n-1)  of  the  array  (node  4,  in  our  8-node 
example)  should  contain  the  sublist  with  the  highest 
items  since  node  0  is  the  first  node  in  the 
one-dimensional  array  of  processor  numbers  and  the 
(n-l)st  node  is  the  last  node  in  the  array  (where  n  = 
number  of  nodes).  During  the  mop-up  stage,  if  one 
allows  node  0  and  the  (n-l)st  node  to  communicate,  the 
higher  node  would  have  the  low-valued  sublist  while  the 
lower  node  would  have  the  high-valued  sublist,  and  hence 
the  sot  would  never  complete. 

To  avoid  this,  a  nonperiodic  boundary  bookkeeping 
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scheme  was  developed:  when  node  0  is  about  to 
communicate  with  its  neighbor  to  the  left,  the  channel 
mask  of  these  two  nodes  must  be  determined  (where  the 
channel  mask  is  a  decimal  number  obtained  by 
converting  the  binary  number  whose  individual  bits 
represent  the  communication  channels  of  the  cube). 
However,  notice  that  node  0  does  not  have  a  neighbor  to 
its  left;  node  0  is  the  leftmost  node  in  the  array.  Hence, 
when  node  0  attempts  to  determine  its  channel  mask 
with  its  left  neighbor,  the  channel  mask  is  set  equal  to  0 
to  prevent  it  from  trying  to  communicate  with  the 
(n-l)st  node.  The  same  is  done  when  the  (n-l)st  node 
attempts  to  communicate  with  its  neighbor  to  the  right 
since  the  (n-l)st  node  is  the  rightmost  node.  If  this 
special  case  is  not  handled  as  described  above,  node  0  and 
the  (n-l)st  node  would  communicate  with  one  another 
since  the  one-dimensional  array  is  treated  as  a  circular 
array  by  the  Crystalline  Operating  System  routine 
"gridchan”  where  node  0  is  adjacent  to  the  (n-l)st  node. 

Figure  S  illustrates  how  the  one-dimensional  array  can  be 
thought  of  as  circular.  Note  that  in  order  for  the  (n-l)st 
node  to  be  seen  as  node  O's  left  neighbor  or  for  node  0  to 
be  seen  as  the  (n-l)st  node's  right  neighbor,  one  must 
look  at  the  circle  in  Figure  S  always  facing  the  center  of 
the  circle  while  standing  at  the  position  of  each  array 
index.  In  order  for  the  list  to  be  sorted  in  ascending  order 
from  left  to  right  along  the  array,  the  node  on  the  left 
must  keep  the  lowest-valued  items  to  itself  while 
sending  away  the  highest-valued  items  to  the  node  on  its 
right.  Therefore,  if  node  0  and  the  (n-l)st  node  were  to 
communicate  with  each  other,  node  0  would  treat  the 
(n-l)st  node  as  its  left  neighbor  and  would  try  to  keep  the 
highest-valued  items  to  itself  while  sending  away  the 
lowest- valued  items  to  the  (n-l)st  node.  This  obviously 
contradicts  the  intended  algorithm,  and  the  same  kind  of 
contradition  results  when  the  (n-l)st  node  is  to 
communicate  with  node  0  treating  node  0  as  its  right 
neighbor.  Because  of  this  circular  action,  an  infinite 
loop  would  result  and  the  list  would  never  be  sorted. 


FlftureS:  Thenumben  Inside  tht  slots  are  processor  numbers; 
tbe  ones  outside  are  arraj  indices. 


In  order  for  a  node  on  the  left  hand  side,  e.g.  node  3,  to 
have  a  sublist  of  lower  items  than  those  of  its 
right-hand-side  neighbor,  e.g.  node  2  on  a  cube  of  8 
nodes,  node  3  must  send  all  of  its  highest  items  to  node 
2  while  node  2  must  send  all  of  its  lowest  items  to  node 
3.  When  sending  items  between  two  nodes  for  this 
purpose,  only  portions  of  each  sublist  are  sent  at  a  time. 
The  bigger  the  portions  which  are  sent  at  once,  the  less 
time  is  wasted  while  a  node  sits  idle  waiting  fcv  the  other 
node  to  send  some  more  items.  In  addition,  the  smaller 
these  portions  are.  the  more  communication  calls  are 
required.  However,  if  the  whole  sublist  is  sent  all  at 
once,  then  there  is  the  risk  of  unnecessary  transfer  of 
data,  depending  on  the  nature  of  the  actual  data  to  be 
sorted.  Therefore,  the  size  of  the  buffer  is  allowed  to 
vary  from  case  to  case  depending  on  the  size  of  the 
smaller  sublist  of  the  two  nodes  communicating. 
Empirically,  40%  of  the  length  of  the  smaller  sublist  is 
a  reasonable  buffer  size.  This  not  only  avoids  hard 
coding,  but  it  also  improves  the  speed  as  shown  below 
in  Figure  6  for  a  case  with  1024  total  items  to  be  sorted. 
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III.  Conclusions  and  Results 

The  parallel  sort  described  above  is  a  generalized  version 
that  works  regardless  of  the  imbalanced  sizes  of  the 
sublists  to  be  sorted.  Although  imbalanced  sublists 
would  not  make  use  of  parallelism  to  its  fullest 
potential,  an  algorithm  should  still  be  general  enough  to 
withstand  the  worst  input.  As  shown  in  Figures  6,  7, 
and  8,  this  general  version  performs  quite  well  when  it  is 
used  to  sort  balanced  leads  of  data  (relatively  constant 
sizes  of  sublists).  There  are  some  efficiency  losses  due 
to  its  generalization.  It  can  be  seen  from  Figures  6,  7, 
and  8  that  the  cost  of  having  a  more  general  algorithm  is 
about  a  factor  of  10%  to  50%  loss  in  efficiency  from  the 
same-list-size  algorithm,  depending  upon  problem  size. 
Interestingly,  though,  note  that  the  loss  is  not 
monotonic  with  number  of  nodes  (the  general  algorithm 
has  a  more  linear  fall-off). 
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Abstract 

We  give  a  group  of  parallel  methods  for  solving  polynomial 
related  problems  and  their  implementations  on  a  distributed 
memory  multicomputer.  These  problems  are  1.  the  evalua¬ 
tion  of  polynomials,  2.  the  multiplication  of  polynomials,  3. 
the  division  of  polynomials,  and  4.  the  interpolation  of  poly¬ 
nomials.  Mathematical  analyses  are  given  for  exploiting  the 
parallelisms  of  these  operations.  The  related  parallel  meth¬ 
ods  supporting  the  solutions  of  these  polynomial  problems, 
such  as  FFT,  Toeplitz  linear  systems  and  others  are  also 
discussed.  We  present  some  experimental  results  of  these 
parallel  methods  on  the  Intel  hypercube. 

1  Introduction 

Polynomials  are  one  of  the  most  useful  and  well  known 
classes  of  functions  in  various  applications.  A  polynomial 
function  is  defined  as 

P„(l)  =  ao  +  Oil  -f  ...  -t-  Onl"  (1) 

where  n  is  an  nonnegative  integer  and  uo,  ...,  Un  are  real 
constants.  When  the  degree  of  a  polynomial  function,  n  is 
very  large,  the  computations  for  polynomial  problems,  such 
as  the  interpolation  of  polynomials,  multiplication  and  di¬ 
vision  of  polynomials  are  intensively  required.  In  addition, 
polynomials  represent  an  important  class  of  expressions  in 
algebraic  manipulation  in  symbolic  computation.  The  ba¬ 
sic  polynomial  arithmetic  operations  such  as  multiplications 
and  divisions  spend  huge  execution  times  in  computer  al¬ 
gebra  processing  (see  e.g.  Ponder[1988],  Siebert-Roch  and 
Muller[1989]).  Thus,  such  computation  problems  are  good 
candidates  for  parallel  computers  both  numerically  and  sym¬ 
bolically. 

We  give  a  group  of  parallel  methods  for  solving  the  poly¬ 
nomial  related  problems,  and  their  implementations  on  a 
distributed  memory  multicomputer.  These  problems  are: 
1.  the  evaluation  of  polynomials,  2.  the  multiplication  of 
polynomials,  3.  the  division  of  polynomials,  and  4.  the  in¬ 
terpolation  of  polynomials.  Parallel  evaluation  method  of 


'This  author  is  supported  in  part  by  the  University  of  Texas  at 
San  Antonio  Faculty  Research  Award. 


polynomials  based  on  the  Horner’s  rule  is  discussed  in  sec¬ 
tion  2.  The  experimental  results  on  the  Intel  hypercube  are 
also  presented.  The  parallelism  of  the  polynomial  multipli¬ 
cation  is  exploited  by  transferring  the  problem  to  a  set  of 
special  FFT  series  functions,  on  which  the  operations  can 
be  perfectly  distributed  among  different  processors.  Sec¬ 
tion  3  gives  the  mathematical  analyses  and  parallel  method 
of  the  polynomial  multiplication.  The  polynomial  division 
problem  is  solved  based  on  parallel  solutions  for  Toeplitz 
triangular  lineu  systems  and  the  parallel  polynomial  multi¬ 
plication,  and  is  discussed  in  section  4.  Section  S  addresses 
a  parzJlel  method  for  the  Lagrange  piecewise  cubic  polyno¬ 
mial  interpolation.  Finally,  we  give  a  summary  and  future 
work  in  the  last  section. 

2  Parallel  evaluation  of  polynomials 

The  evaluation  of  a  polynomial  function  is  a  basic  operation 
in  polynomial  problems; 

Pn(io)  =  Ood-OlXo-l-  ...,  -f  On*o  (2) 

where  ao,ai,  ...,  On  and  the  indeterminate  xq  are  the  in¬ 
put  variables,  and  Pn{xo)  is  the  output  solution  variable.  A 
straight  forward  parallel  method  for  (2)  is  to  partition  the 
evaluation  operations  into  a  binary  tree  structure  so  that 
the  suboperations  can  be  distributed  among  the  processors 
(see  e.g.  Siva  and  Murthy  [1989]).  The  drawbacks  of  this 
method  are  that  the  number  of  multiplications  required  is 
not  minimized,  and  large  amount  of  communication  are  in¬ 
volved  among  different  layers  of  the  tree  processors  in  the 
process  of  evaluation. 

Horner’s  rule,  which  evaluate  a  polynomial  by  the  scheme 

Rnlxo)  =  (  . ..  ((a„x -I- o„_i)x -1- a„_j)z ...  -l-aj)x-foo 

requires  exactly  n  multiplications,  and  is  the  optimal 
method  to  evaluate  a  polynomial  in  terms  of  minimizing  the 
operations.  The  Horner’s  method  may  be  easily  described 
as  a  sequential  recurrence: 

f  6n  =  On  .j. 

1  6j_i  =  biXo  +  ai-t  »  =  n,n-  1,  ...,  1  '  ’ 
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Based  on  (3),  we  develop  a  parallel  Horner’s  method  to 
evaluate  a  polynomial.  Let  oi-j  =0,  j  =  2,  p,  where  p 
is  number  of  processors  used  to  evaluate  a  polynomial. 


Do  i  =  1  to  p  in  parallel 
begin 

for  i  =  n  —  (p  +  j  —  1 )  to  p  —  j  +  1  step  — p 
hf  —  4"  Ufi 

6l_j  =  Xg  +  Ol_j 

end 

p 

P^To)  = 

r=i 

The  evaluation  operations  distributed  among  p  processors 
are  independent  except  the  last  step  to  collect  and  add  all 
sub-solutions  among  the  p  processors.  Thus  the  speedup  can 
be  determined  by 

Sp  =  - - -  (4) 

n/p-hp-  1  -ftc 

where  n  is  the  number  of  operations  (in  time  unit)  for  an  n*'* 
degree  polynomial  to  be  evaluated  on  a  sequential  machine, 
and  n/p-^-p—l-\-ic  is  the  number  of  operations  (in  time  unit) 
plus  the  communication  time  ic  to  collect  the  subsolutions 
among  the  p  processors  for  evaluating  the  same  polynomial 
on  a  multicomputer  with  p  processors.  The  communication 
time  tc  is  trivial  comparing  with  other  operations  since  the 
operation  to  add  all  subsolutions  among  the  p  processors 
can  be  done  through  a  global  tree:  the  subsolutions  are 
accumulated  from  the  leaf  level,  then  send  to  the  host,  (see 
Moler  [1986])  This  method  minimizes  the  communication 
times  to  loy3(p)  instead  of  p  in  a  sequential  message  transfer. 
Thus,  when  n/p  >>  p  —  I  -f  tc,  we  may  obtain  a  close  linear 
speedup  Sp  »  p.  However,  the  worest  case  is  when  p  =  n, 
the  method  become  a  sequential  one  and  speedup  5p  <  1.  In 
addition,  this  parallel  algorithm  only  applies  to  the  situation 
when  n  mod  p  =  0. 


32 


16 

8 

4 

2 

248  16  32  P 

Figure  1;  Performance  of  the  parallel  Horner’s  method  on 
hypercube 

3  Parallel  method  for  the  multiplica¬ 
tion  of  polynomials 

Let  F(x)  and  G(x)  be  two  n  —  1th  degree  polynomial  func¬ 
tions, 

F(x)  =  00-1-  011  -f-  ...  +  a„-ix"~’  (5) 

and 

G(x)  =  bo  +  b]X  -f  ...  (6) 

the  multiplication  problem  is  then  defined  as 

F(i)  X  G(z)  =  do  +  di*  +  ...  4-  dan-z**"”*  (7) 

where 

dk  =  0,6,  t  =  0,l,  ...,  2n-2 


The  paraUel  Horner’s  method  has  been  implemented  on  a  polynomials  p,  and  pa,  the  foUowing  relations 

Intel  hypercnbe  multicomputer.  InitiaUy,  each  processor  is  of  polynomials  change  under 

distributed  foUowing  coefficients  and  variables:  operations  of  multipbcation  (see  e  g.  Fateman  [1974]): 


n:  the  degree  of  the  polynomial; 
p:  number  of  processors  used; 
j:  the  processor  index; 
zo:  the  indeterminate  variable; 

Oj,  i  =  p  i  —  1,  ...,  p  —  j  4- 1:  the  coefficients  for  the  j'** 
processor; 

and  coefficient  an-,41. 


deyree(pi  x  pa)  =  deyree(pi)  4-  dejree(pa) 


iize(pi  X  Pa)  <  aize(pi)  x  jixe(pa). 

The  coefficients  in  polynomials  (5)  and  (6)  may  be  defined 
as  following  two  sequences  respectively: 


.4(fc) 


Ok  0<fc<n  —  1 
0  B-l<t<2n-l 


B{k) 


6k  0  <  h  <  n  -  1 
0  n— l<l:<2fi  —  1. 


A  polynomial  of  n  =  128  is  evaluated  on  different  number  of 
processors  on  the  hypercube.  The  speedup’s  with  different 
number  of  processors  are  plotted  in  Figure  1 . 
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where  A(k)  and  B(k)  functions  are  periodic  of  period  2n 
such  that 

A{k  +  2nl)  =  A(k) 

and 

B(k  +  2nl)  =  B{k) 

where  Jb  =  0, 1,  2n  —  1,  and  /  is  an  integer.  Then 

2n-l 

D{i)  =  ^A{k)B{i-  k)  i  =  0 . 2n-l  (8) 

k=0 

The  coefficients  d,,  i  =  0,  2n  —  2  of  the  multiplication 
in  (7)  can  be  determined  from  (8): 

d,  =  D(i),  i  =  0,  1,  ,2n-2  .  (9) 


Thus,  the  major  work  for  the  multiplication  is  to  compute 
the  periodical  function  of  D{i)  for  i  =  0,  2n  —  1  in  (8). 
We  decompose  the  computation  for  D(i)  in  foUoiving  steps 
with  the  aid  of  the  discrete  Fourier  transform  series: 

2n-l 

{•)  *0)  =  J  =0.  2n-l 

Ji=0 

an-I 

(•0  y(j)  =  7  =  0,  ....  2n-i 

ks:0 

(«»•)  =  x(j)yU)  j  =  0 .  2n  -  1 

and  finally, 

3n-l 

(iv)  D(k)  =  t  =  0,  2n-l 

1=0 


The  major  operations  in  (t),  (ii),  and  (iv)  can  be  easily 
written  in  the  form  of  matrix  vector  multiplication: 


I  =  Wq  (12) 


where  1  =  (to,  li 

,  ..,  <2n 

-a.lan-i)*^, 

!  "’“n 

«"2n 

“’an 

“’an 

0 

1 

.2 

«*2n 

•"an 

•“an 

"an 

w  ^ 

,...0 

2(2n-l) 

t2n-lK2>»-J) 

\  ^2n 

“’an 

•"an 

•  “’an 

/ 

and  q  =  (qo,  qn-i,0,  0)^.  The  1  vector  represents 

the  X,  y  OT  D  vector  in  (i),  (ti)  or  (iv).  The  q  vector  rep¬ 
resents  the  two  sequences  A  or  B  of  period  2n  in  (8).  The 
complexity  of  the  multipbcation  in  (12)  is  0(4n^). 

The  Fast  Fourier  Transform  (FFT)  developed  by  Cooly 
and  Tukey  [1965]  has  been  widely  used  for  computing  the 
standard  discrete  Fourier  transform  of  (10)  and  (11).  The 
complexity  of  the  FFT  reduces  to  0(nlog2n).  This  improve¬ 
ment  comes  from  a  reordering  of  data  by  taking  advantage 
of  the  fact 


for  j  =  0,  ...,  n  —  1  and  i  =  0,  ...,  n  —  1.  The  FFT 
computation  structure  can  also  be  easily  processed  in  par¬ 
allel  with  some  control  of  synchronization  and  communi¬ 
cation.  The  implementation  and  experiments  of  parallel 
FFT  methods  have  been  done  on  both  distributed  memory 
and  shared  memory  multiprocessors  (see  e.g.  Chamberlain 
[1986],  Chan[1966],  and  Norton  and  Silberger[1968]). 


The  value  iv  is  a  complex  number,  tv  =  where 

t  =  V~li  which  is  also  called  2n‘''  primitive  root  of  1.  In 
other  words,  tv*"  =  1 ,  and  every  other  2n‘*  root  of  1  can  be 
represented  as  some  power  of  tv. 


We  apply  FFT  to  our  special  series  functions  in  (i),  (ii) 
and  (iv)  with  slight  modifications.  Since  the  functions  are 
periodic  of  period  2n,  and  the  q  vector  in  the  multiplication 
form  (12)  is  only  half  full,  the  complexity  to  compute  (i), 
(ii)  and  (iv)  reduces  to  0(nlog2n  —  n). 


The  Operations  in  (iii)  can  be  done  perfectly  in  parallel. 
Notice  that  the  functions  in  (i),  (ii)  and  (iv)  are  similar 
to  the  ones  in  standard  discrete  Fourier  transform  and  its 
inverse  (see  e.g.  Aho,  Hoperoft  and  UUman  [1974]): 

n  — 1 

f{j)=  ^^x{k)w^'‘  >=0,  ...,  n-1  (10) 

and 

n— 1 

i(*)  =  ^/(^tv"^*  k  =  0,  ...,  n-1  (11) 

J=0 

where  tv  is  an  n*'*  primitive  root  of  1,  and  f{j)  is  a  real 
function  for  j  =  0,  ...,  n  —  1.  Thus,  the  transforms  defined 
for  computing  polynomial  multiplication  are  special  cases  of 
the  standard  discrete  Fourier  transform. 


The  parallel  method  detailed  below  fulfills  the  require¬ 
ment  as  every  processor  involved  immediately  in  the  com¬ 
putation,  and  no  arithmetic  operation  on  the  same  data  se¬ 
quence  terms  is  ever  duplicated  on  any  pair  of  processors. 
The  best  arrangement  of  the  processors  on  a  distributed 
memory  multicomputer  is  a  hypercube  topology  for  this 
FFT  type  processing  in  terms  of  communication  efficiency 
since  the  exchange  of  values  are  only  required  between  neigh¬ 
bor  processors. 

Consider  a  multiplication  of  tvm  3rd  degree  polynomials, 
i.e.  n  =  4  in  the  polynomial  defined  in  (5)  and  (6).  Substi¬ 
tute  n  =  4  to  the  matrix  vector  multiplication 

t  =  Wq, 
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90.92 


we  have  t  =  ■,  t?)^, 


0  0  0 

ti;g  wg  wg 

0  12 
Wg  Wg  Wg 


\  Wg  Wg  Wg*  .  .  Wg^  J 

and  q  —  (90.91, 92.93. 0,0, 0,0)^.  Using  the  property  of  pe¬ 
riod  2n 

•^2*  =<"■“' 

for  j  =  0,  ....  2n  —  1  and  k  =  0,  2n  —  1,  the  W  matriic 
becomes 

^000  0  V 

/  u/g  uig  a/g  .  .  fUg  \ 

I  UIg  Wg  w|  .  .  Wg  I 


0  0 

Og  V  ■'  u>g 

,  ^  . 

Ug  .  *»  Wg 

;;  ■  »  *"8 

•  >>  . 

Og  . |»»  Wg 

Kg  ■  wl 

Wg  w| 

o|  "  '  u»g 


\  Wg  Wg  Wg  .  .  Wg  / 

The  matrix  vector  multiplication  for  n  =  4  may  be  decom¬ 
posed  to  its  parallel  form: 

t  =  WtW^tq' 

where  Wt  is  a  4  block  diagonal  matrix 

/  wg  Wg  \ 

4  4 

UIg  tUg 

2  2 
Wg  Wg 

wg  wg 

Wg  Wg 
Wg  Wg 

Wg  Wg 

V  Wg  wg  / 


W2  is  a  2  block  diagonal  matrix 


0 

Wg 

0 

0 

Wb 

0 

0 

wg 

0 

wg 

Wg 

0 

Wg 

0 

0 

Wg 

0 

Wg 

wg 

0 

wg 

0 

0 

wg 

0 

wg 

6 

Wg 

0 

6 

Wg 

0 

0 

w| 

0 

6 

Wg 

and  q'  =  (90, 9i, 92, 93, 9o, 91.92,93)^-  In  general,  the  matrix 
vector  multiplication  of  order  2n  in  (12)  may  be  decomposed 
into  the  form  of 

t  =  W„  ...  IV,W29'  (13) 

where  IV,  is  the  i  block  diagonal  matrix,  for  i  = 
2, 2*. 2®,  ...,  n,  and  q' =  (qo,  9n-i,9o,  9n-i)^- 


The  formula  (13)  can  be  performed  perfectly  in  parallel 
on  a  p  processor  multicomputer  as  long  as  each  processor 
keeps  a  copy  of  variables  91,  ....  qn-i,  and  matrix  IV,  for 
t  =  2,2^,  ...,  n  are  distributed  by  rows  to  the  p  proces¬ 
sors.  Figure  2  gives  the  operation  dependency  graph  in  each 
processor  for  the  example  of  n  =  4  on  a  4  processor  ir-uiti- 
computer. 


Figure  2:  Operation  dependency  graph  for  a  polynomial 
multiplication  of  order  n  =  4  on  a  4  processor  multicom¬ 
puter. 


4  Parallel  method  for  the  division  of 
polynomials 

Given  a  2n  degree  polynomial  A(t),  and  a  n  degree  polyno¬ 
mial  B(x), 

A(x)  =  no  +  Oil  -I-  ...  -f  02nX*" 


B(x)  =  60-1-  ill  +  ...  -t-  inx" 
the  division  problem  is  then  defined  as: 

A{x)  =  B(x)Q(x)  -k  R(x)  (14) 

where  Q{x)  is  the  quotient  polynomial  of  A{x)  B{x),  and 
R(x)  is  the  remainder  polynomial. 

We  can  show  that  the  coefficients  of  the  polynomial  Q(x) 
with  variables  of  (z",z"~',  ...,  z°)  are  identical  to  the  co¬ 
efficients  of  the  polynomial  D(x)  x  K(x)  of  the  variable  set 
(z*",z*""'’,  ...,  z"),  where  0(z)  is  the  quotient  polynomial 
for 

z*”  =  B{x)D{x)  -I-  S(z)  (15) 

and  A'(z)  is  the  polynomial  formulated  from  A(x), 

K(x)  =  Un  Ufi  +  lX  -|-  ...,  02n-l*"  *  -f  Ojnx" 

(gee  e.g.  You[1983]).  Thus,  the  major  task  in  the  division  of 
polynomials  is  to  compute  the  coefficients  of  the  polynomial 
D(z). 

In  (15),  D(x)  is  defined  as 

D{x)  =  do  +  di+  ...,  d„_,z"-’d„z".  (16) 


Z«S 


Substitute  (16)  to  (15),  and  compaie  the  coefficients  on  both 
sides  of  the  polynomial,  we  get 


BD  =  I 


(17) 


where 


/  b„  b„-i 
bn 


B  = 


bo  \ 


bn—l 

bn  / 


is  an  upper  triangular  Toeplitz  matrix  formed  from  the  given 
coefficients  of  polynomial  B(x),  and 


/d„ 


D  = 


dn-l 

dn 


bo 


dn—\ 

dn 


is  another  upper  triangular  Toeplitz  matrix  formed  from 
the  unknown  coefficients  of  polynomial  D(x),  and  I  is 
the  identity  matrix.  From  (17),  the  coefficients  of  d,  for 
t  =  0,1,  ...,  n  may  be  determined  by  following  triangular 
Toeplitz  linear  systems  of  equations: 


/  f>n-l  bo  \ 

1 

( "  \ 

bn 

bn~l 

dn-l 

0  1 

\  bn  ) 

\  dn  y 

^  1  / 

(18) 

Several  parallel  algorithms  have  been  developed  for  solv¬ 
ing  the  full  and  upper  triangular  Toeplitz  linear  system 
of  equations,  (see  e.g.  Bini[1984],  Sai,  Li  and  Xie[1986]). 
These  algorithms  are  baised  on  the  recurrence  methods  for 
solving  the  Toeplitz  linear  systems.  Parallelisms  are  ex¬ 
ploited  in  the  vector  processing  in  each  step  of  the  recur¬ 
rence.  Since  the  computation  of  each  step  in  the  recurrence 
is  dependent  on  the  results  of  the  previous  step,  and  the  sizes 
of  the  vector  of  each  step  increase  as  the  recurrence  steps  in¬ 
crease,  those  algorithms  would  not  be  efficient  to  gain  good 
speedups  in  a  distributed  memory  multicomputer.  Li  and 
Coleman  [1988]  [1989]  develop  a  group  of  parallel  methods 
for  solving  general  triangular  linear  system  on  distributed 
memory  multicomputer.  Based  on  the  idea  of  Li-Coleman’s 
methods,  and  the  Toeplitz  triangular  system  solution  de¬ 
pendency  graph  in  Figure  3,  we  give  a  parallel  method  for 
solving  the  Toeplitz  triangular  linear  system  (18).  Assume 
the  Toeplitz  triangular  matrix  is  distributed  in  a  wrap  fash¬ 
ion  among  the  processors:  column  j  is  assigned  to  processor 
{j  —  1)  mod  p,  where  ji  =  0, 1,  ...,  n  and  p  is  the  number  of 
processors  used.  Thus  the  columns  distribution  function  is 
defined  as 

PU)  =  (i  -  1)  '^od  p 

The  initial  conditions  are  to  set  following  two  vector  vari¬ 
ables  5um(),  the  partial  sum  results  for  the  solution,  and 
PsumQ  the  partial  sum  results  computing  in  par2dlel: 


if  nodtnum  =  P{n)  then 
begin 


Figure  3:  Toeplitz  triangular  system  solution  dependency 
graph 


5um(l)  =  1; 

5um(i)  =  0  for  i  =  2,  ...,  p; 

Paum(i)  =  0  for  i  =  1,  ...,  n  —  p  -f- 1; 
end 
else 
begin 

sum(i)  =  0  for  i  =  1,  ...,  p; 

Panm(i)  =  0  for  i  =  1,  ...,  n  —  p  +  1 
end. 


The  program  on  each  processor: 


for  j  =  n  downto  1  do 
begin 

if  nodenum  =  P{j)  then 
begin 

receive  5um(i)  for  i  =  1,  ...,  p  —  1  and  j  <  n; 
dj  =  (Sum(l)  -h  Paum{j))/bn-, 

Sum{i)  =  5um(«  -1- 1)  —  On->dj  + 

+P3um(j  —  i)  for  i  =  1,  ...,  p  —  2; 
Sum(p  -  1)  =  o„_p+idj  +  Paumij  -p  +  1); 
Send  Stim(i)  to  node  P(j  —  1) 

for  »  =  1,  ...,  p  —  1  and  j  >  1; 

Paum(i)  =  a„_,+i  for  «  =  1,  ...,  n  —  p 
end 
end. 


The  coefficient  of  quotient  polynomial,  Q{x)  is  determined 
by  K{x)y.D{x),  in  which,  a  parallel  multiplication  described 
in  the  last  section  is  applied.  Then  the  remainder  polyno¬ 
mials  if(z)  is  determined  by 

R{x)  =  A{x)  -  B{x)Q{x), 

where  B{x)  x  Q(z)  uses  the  parallel  multiplication  method 
again. 


5  Parallel  method  for  Lagrange  piece- 
wise  cubic  interpolation 

Given  a  set  of  n  -f  1  pairs  of  real  values,  (zi,yi)  for  i  = 
0, 1,  ...  n  with  distinct  zjs,  there  exists  a  unique  polynomial 
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Pn(i)  of  degree  n  such  that  P{xi)  =  yi  for  i  =  0, 1 .  n. 

This  interpolating  polynomial  Pn{x)  can  be  written  in  the 
Lagr^mgian  form 


Pn^J^yix.Mx)  (19) 

isO 


where 


n 


'.(*) = n 


{x-Xk) 
(x,  -  Xk) 


for  ifi  k  and  i  =  0, 1,  n. 


In  mrmy  practical  problems,  Lagrange  interpolating  poly¬ 
nomials  may  not  be  suitable  for  use  as  an  approximation. 
This  is  because  polynomials  of  high  degree  often  have  a  very 
oscillatory  behavior,  which  is  not  desirable  in  approximating 
functions  that  are  reasonably  smooth.  One  alternative  to  in¬ 
terpolation  is  to  find  a  polynomial  of  a  low  degree  that  "best 
fit”  the  data,  in  which,  piecewise  polynomial  interpolation 
is  an  attractive  one.  In  piecewise  polynomial  interpolation, 
several  lower-degree  polynomials  are  joined  together  in  a 
continuous  fashion  so  that  the  resulting  piecewise  polyno¬ 
mial  interpolates  the  data.  One  of  the  most  commonly  used 
piecewise  interpolation  methods  is  the  Lagrange  piecewise 
cubic  interpolation. 


The  Lagrange  form  of  (19)  may  be  decomposed  as  follows 

n  n 

(20) 

ksO  fcsO 


where 


k=0 


Vi 

(xi  -  Xk) 


Let  y  —  f(x)  be  a  continuous  function  in  [o,  6].  The 
Lagrange  piecewise  cubic  function  interpolates  f(x)  on  the 
nodes  =  a  -I-  jh  for  j  =  0, 1  ...,  n  and  h  =  (b  —  a)/n. 
Thus,  we  obtain  n  —  2  cubic  polynomials 

Pi{x)  =  a,  -f  bix  +  ax^  -b 

for  i  =  1,  ...,  n  —  2.  Substitute  (zi_i,yi_i),  (ii,yi), 
(z,+i,y,+i),  and  (xi^2,!/i-n)  the  Lagrange  polynomial 
form,  and  define 


for  i  ji  i  and  i  =  0, 1,  ...,  n.  The  coefficients  C,  can  be  cal¬ 
culated  in  parallel  perfectly  on  a  distributed  multicomputer. 
The  paralle  algorithm  on  p  processors  is  in  following  form: 


Do  j  =  1  to  p  in  parallel 
begin 

for  i  =  (j  -  l)[n/p]  to  j[n/p]  -  1  do 
begin 
h  =  l; 

for  ib  =  0  to  n  and  (k  ^  i) 

=  hy(xi) 

end 

end. 


The  polynomial  evaluation  on  zo  point  based  on  the  decom¬ 
posed  Lagrange  form  (20)  can  be  done  in  parallel  for  most 
of  the  calculation  except  the  sum  operations  at  the  end. 


Do  j  =  1  to  p  in  parallel 
begin 
5,  =  0; 

for  i  =  (i  -  l)[n/p]  to  >[n/p]  do 
begin 
A  =  l; 

for  jb  =  1  to  n  and  (Jb  i) 
h  =  A(zo  -  Xk)-, 
Sj=Sj+  hCi 
end 

end 

p 

Pn(xo)  = 

l  =  i 


An  =  +^*+J  +*•+*). 

Ai2  =  *•+*)> 

~  *H2). 

Ai4  =  g^(Zi_l  Zi  Z,.n), 

Ajl  =  +  *«+l*t+2), 

A22  =  ^^(Zj_lZ,+  i  -b  Zi-iZ,4.2  -b  Xi+lXi^.2), 

A23  =  *'“'*•+*  *i*i+2), 

A24  =  e]^(*’“***  +  ZiZi+l), 

.  1 

A3I  =  g^Z<Zi+iZi+2, 

A32  =  2^*>-h^>+i*<+2, 


A33  =  -^i^Xi+lXiXi^i, 

A  1 

^34  —  ^^3  *•— 1  +  l » 

where  Aj,  for  i  =  1,2, 3,4  and  j  =  1,2,3,  then  the  coeffi¬ 
cients  of  ai,bi,Ci  and  di  for  i  =  1,  ...,  n  —  2  may  be  defined 
in  following  matrix  multiplication  form 


where 


A  = 


A  =  BY 


Oi 

0^2 

61 

62 

fcfi— 2 

Cl 

C2  - 

Cn*2 

di 

d2 

dn-~^2 

(21) 
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and 


Y  = 


37iS’ 
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The  parallel  processing  for  the  matrix  multiplication  (21) 
is  straight  forward.  Matrix  Y  is  distributed  among  p  proces¬ 
sors  in  a  multicomputer,  and  p  set  of  columns  of  matrix  B 
are  assigned  to  the  each  processor.  Then  the  coefficients  of 
Ui,  bi,  c,  and  di  for  t  =  1,  ...,  n  —  2  are  computed  in  parallel 
on  p  processors.  The  evaluation  of  each  cubic  function  may 
also  be  done  on  p  processors  in  parallel  by  Horner’s  rule 

»■(*)  =  ((“■*  +  ft*)*  +  c*)*  + 

for  i  =  1,  ...,  n  —  2. 

6  Conclusions  and  future  work 

We  have  discussed  parallel  methods  for  polynomial  eval¬ 
uation,  polynomial  multiplication  aind  polynomial  division 
and  Lagrange  piecewise  cubic  interpolation,  and  presented 
some  of  the  experimental  results  on  the  Intel  hypercube. 
These  methods  may  be  easily  implemented  on  a  distributed 
memory  multicomputer.  We  expect  the  methods  will  gain 
good  speedups’  on  a  distribute  memory  multicomputer 
with  careful  data  distribution.  The  next  step  of  the  work 
is  to  implement  the  rest  of  the  methods  on  a  local  mem¬ 
ory  multicomputer  to  test  the  performance.  We  also  plein 
to  investigate  other  parallel  methods  for  polynomial  related 
problems,  such  as  multi-solutions  of  a  polynomial,  different 
polynomial  interpolation  methods,  cubic  spline  interpola¬ 
tion  and  others. 
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Abstract 

Continuation  methods  compute  paths  of  solutions  of  non¬ 
linear  equations  that  depend  on  a  parameter.  This  paper 
examines  some  aspects  of  the  multicomputer  implemen¬ 
tation  of  such  methods.  The  computation  is  done  on  the 
Symult  Series  2010  multicomputer. 

One  of  the  main  issues  in  the  development  of  concurrent 
programs  is  load  balancing,  achieved  here  by  using  appro¬ 
priate  data  distributions.  In  the  continuation  process,  a 
large  number  of  linear  systems  have  to  be  solved.  For 
nearby  points  along  the  solution  path,  the  corresponding 
system  matrices  are  closely  related  to  each  other.  There¬ 
fore,  pivots  which  are  good  for  the  LU-decomposition  of 
one  matrix  are  likely  to  be  acceptable  for  a  whole  segment 
of  the  solution  path.  This  suggests  to  choose  certain  data 
distributions  that  achieve  good  load  balancing.  In  addi¬ 
tion,  if  these  distributions  are  used,  the  resulting  code  is 
easily  vectorized. 

To  test  this  technique,  the  invariant  manifold  of  a  sys¬ 
tem  of  two  identical  nonlinear  oscillators  is  computed  as 
a  function  of  the  coupling  between  them.  This  invariant 
manifold  is  determined  by  the  solution  of  a  system  of  non- 
lineetr  partial  differential  equations  that  depends  on  the 
coupling  parameter.  A  symmetry  in  the  problem  reduces 
this  system  to  one  single  equation,  which  is  discretized  by 
finite  differences.  The  solution  of  this  discrete  nonlinear 
system  is  followed  as  the  coupling  parameter  is  changed. 

1  Introduction 

Concurrent  programming  is  difficult  and  needs  to  be  sim¬ 
plified  This  simple  statement  describes  a  major  goal  of 
research  into  concurrent  computing.  The  focus  on  sim¬ 
plification  is  Justified,  because  the  accumulated  experi¬ 
ence  of  earlier  feasibility  studies  is  overwhelmingly  posi- 

*Thi»  research  is  supported  in  part  by  Department  of  Energy 
Grant  No.  DE)-AS03-76ER720I2.  This  material  is  based  upon  work 
supported  by  the  NSF  under  Cooperative  Agreement  No.  CCR- 
8909615.  The  government  has  certain  rights  in  this  material. 


tive.  These  feasibility  studies  required  machine-dependent 
program  and  problem  reformulation.  To  raise  the  concur¬ 
rent  technology  from  the  level  of  feasible  to  that  of  usable, 
much  of  current  reseatfch  focuses  on  simplification  of  the 
concurrent-programming  task. 

At  the  heart  of  most  efficient  concurrent  programs  is 
data  locality:  the  data  is  stored  in  memory  locations 
“near”  the  processor  using  the  data.  To  achieve  data  lo¬ 
cality,  a  data  distribution  must  be  introduced.  In  general, 
that  is  a  task  the  programmer  must  perform,  because  the 
best  data  distribution  is  determined  by  global  considera¬ 
tions  not  accessible  to  analysis  by  low-level  system  com¬ 
ponents  (hardware,  operating  system,  and  compiler). 

This  is  illustrated  by  the  following  simple  example  of 
matrix- vector  multiplication.  Let  A  be  an  A/  x  matrix, 
and  X  and  y  vectors  of  dimension  N  and  M,  respectively. 
The  assignment: 

y;=Ax  (1) 

requires  the  evaluation  of  a  matrix-vector  product.  If 
this  were  a  self-contained  program,  not  part  of  a  larger 
program,  the  optimal  data  distribution  and  correspond¬ 
ing  optimal  program  is  easily  derived.  The  rows  of  the 
matrix  A  should  be  distributed  evenly  (within  divisibility 
constraints)  over  all  concurrent  processes.  The  resulting 
program  is  optimal,  because  it  is  perfectly  load  balanced 
and  it  requires  no  communication.  Similarly,  for  the  as¬ 
signment: 

:=  y^A  (2) 

one  should  distribute  the  matrix  columns.  For  a  compos¬ 
ite  program  that  evaluates  both  assignments  (1)  and  (2) 
neither  distribution  is  optimal.  The  best  distribution  dis¬ 
tributes  both  rows  and  columns;  moreover,  the  process 
grid  is  a  function  of  the  ratio  of  the  number  of  times  (1) 
versus  (2)  is  evaluated.  Only  the  user  can  have  a  reason¬ 
able  estimate  of  this  last  quantity;  hence,  only  the  user 
can  determine  the  best  distribution.  (We  have  ignored 
the  distribution  of  the  vectors  for  ease  of  exposition;  the 
conclusion  remains  valid  if  one  includes  them.) 

Supplying  the  data  distribution  is  thus  a  user  task. 
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Considering  our  goal  to  simplify  concurrent  program¬ 
ming,  supplying  the  data  distribution  should  be  the  only 
concurrency-related  user  task.  Programming  languages 
that  allow  postponing  decisions  about  data-distribution 
are  under  development,  see,  e.g.,  Chen  [2],  The  concept, 
however,  is  independent  of  particular  notations  or  lan¬ 
guages,  and  it  can  be  evaluated  within  existing  concur¬ 
rent  computing  systems  (although  some  overheads  are  to 
be  expected  as  a  result).  The  program  discussed  in  the 
remainder  of  this  paper  is  a  realistic  illustration  of  such 
an  approetch  to  concurrent  computing,  where  the  data  dis¬ 
tribution  is  imposed  only  after  the  program  was  fully  de¬ 
veloped.  In  spite  of  the  restriction  that  all  ideas  had  to 
be  implemented  at  the  software  level,  instead  of  at  the 
language  or  compiler  level,  excellent  performance  was  ob- 
tmned.  To  obtain  the  best  performance,  a  dynamic  data 
distribution  is  introduced,  which  is  periodically  2idapted 
to  achieve  global  load  balance.  In  this  respect,  our  ap¬ 
proach  differs  significantly  from  conventional  paralleliza¬ 
tion  strategies  that  break  up  programs  into  small  “man¬ 
ageable”  pieces,  typically  program  loops,  and  consider 
each  as  an  independent  entity. 

An  outline  of  the  paper  follows.  The  mathematical  as¬ 
pects  of  continuation  and  its  application  to  the  computa¬ 
tion  of  invariant  manifolds  is  discussed  in  section  2.  These 
aspects  are  covered  only  to  the  extent  necessary  for  un¬ 
derstanding  the  algorithmic  aspects  of  the  program.  For  a 
more  detailed  treatment,  see  [12].  In  section  3,  we  discuss 
the  implementation  of  the  program,  and  in  section  4  its 
performance. 


where  Aia  M  x  M.  The  matrix  A  is  singular  at  folds;  the 
bordered  system,  however,  is  well  conditioned. 

We  use  two  concurrent  solution  methods  for  such  bor¬ 
dered  systems.  Our  first  method  is  a  variant  of  Keller’s 
bordering  algorithm  [7]  that  takes  into  account  the  possi¬ 
ble  singularity  of  the  matrix  A.  The  second  method  is  a 
variant  of  Goovaerts  [5].  Here,  we  consider  only  the  first 
method,  which  begins  by  computing  an  LU-decomposition 
of  A.  Because  the  matrix  A  may  be  singular,  partial  piv¬ 
oting  is  often  not  sufficient,  and  a  more  general  pivoting 
strategy  must  be  used.  For  simplicity,  the  only  dynamic 
pivoting  strategy  considered  here  is  complete  pivoting,  but 
other  dynamic  strategies  are  easily  substituted.  Once  the 
LU-decomposition  of  A  is  known,  the  bordered  system  is 
solved  using  slightly  modified  back-solves  and  the  solution 
of  a  2  X  2  system. 

Numerical  and  performance  results  are  given  for  a  — 
rather  involved  —  test  problem,  namely  the  numerical 
calculation  of  the  invariant  manifold  of  a  parameter  de¬ 
pendent  dynamical  system: 

f  =  F(v,A^  (6) 

Here,  v{t)  £  x  IR,^  and  F  is  a  mapping  from  x 
IR^  X  [0,Ao]  into  (With  we  denote  the  standard 
2-torus.)  The  specific  example  that  we  have  treated  is  a 
system  of  two  nonlinear  coupled  oscillators,  where 

V  =  [(/,,,!/,  ro.r,]^ 

and 


2  Continuation  and  Invariant 

Manifolds 

Consider  a  system  of  M  equations: 

G(u,A)  =  0  (3) 

for  u  G  IR*^,  which  depends  on  a  parameter  A  €  IR-  Here, 
G  :  IR"  X  IR  — ► 

is  a  smooth  map.  By  a  solution  branch  we  mean  a  one 
parameter  family 

(u(s),  A(s))  €  X  R,  «a  <  «  <  (4) 

of  solutions  of  (3)  depending  smoothly  on  some  parameter 
s  £  [so,S4].  Because  of  the  importance  for  applications, 
many  numerical  methods  have  been  devised  and  investi¬ 
gated  to  compute  such  branches  [6,8].  Assuming  that  the 
branch  (4)  contains  only  regular  points  and  folds,  one  has 
to  solve  linear  systems  whose  matrices  have  the  form: 

r  A  h 
S 


£  r(W  +  1)x(M+1) 


(5) 


0/ 

p _  ^ 

'•0(1  -rl) 

.  n(l  -r?)  _ 

-  cos  2O0  -(-  ^  (cos(0o  +  )  -  sin(flo  -  )) 

.  -  cos 20\  -h  ^  (cos(0o  +0i)  -  sin(^i  -  Oq)) 

rj  (sin(0o  +  ^1)  +  cos(0o  -  ®i))  -  ro{l  +  sin20o) 
ro(sin(0o +  ^i) +  cos(0o  -  ^1))  - '•1(1 +  sin2(?i) 

(The  value  ofw  is  —0.55  in  our  calculations.)  See  Aronson, 
Doedel,  and  Othmer  [1]  for  a  motivation  of  this  system 
and  for  the  study  of  many  interesting  bifurcation  phe¬ 
nomena.  Also,  see  Dieci,  Lorenz,  and  Russel  [3]  for  a 
sequential  calculation  of  some  invariant  manifolds. 

In  the  uncoupled  case,  A  =  0,  the  system  has  the  at¬ 
tracting  invariant  2-torus 

M(A  =  0)=  {(R,l,l)  :0Gr^}  C  x 

It  follows  from  general  theory  (see  Fenichel  [4]  and 
Sacker  [9])  that  the  torus  persists  for  a  sufficiently  small 
coupling  constant  A  and  that  it  can  be  parameterized  in 
the  form: 
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A4(A)  =  {(0,  R{e,X):e£T^}, 


where  6  — ►  A)  is  a  function  from  — ►  IR^.  This 

vector  function  i2(',  A)  is  the  solution  of  a  first  order  sys¬ 
tem  of  partial  differential  equations,  which  depends  on  A. 
These  partial  differential  equations  are  discretized,  and  a 
symmetry  is  utilized  to  obtain  a  finite  dimensional  system 
of  the  form  (3). 

From  general  theory  one  expects  that  the  tori  Ad(A) 
loose  more  and  more  derivatives  as  A  increases.  The  torus 
“bre^lks”  in  a  certain  A-region  and  disappears.  The  calcu¬ 
lations  in  [3]  show  breaking  at  about  A  =  0.2527.  In  [12], 
we  compute  a  solution  branch  of  the  discretized  system 
on  a  25  X  25  grid,  and  we  obtain  several  fold  points  of  this 
discrete  system  between  A  =  0.2430  and  A  =  0.2448. 

3  Implementation 

The  concurrent  efficiency  of  the  bordering  algorithm  is 
determined  almost  exclusively  by  the  efficiency  of  the  LU- 
decomposition.  The  latter,  in  turn,  depends  crucially  on 
an  interplay  between  the  pivot  locations  and  the  distribu¬ 
tion  of  the  matrix  entries  over  the  concurrent  processes. 
In  particular,  if  the  pivots  are  known  in  advance,  the  data 
distribution  can  be  chosen  accordingly,  and  near  ideal  load 
balance  can  be  achieved.  In  this  case,  the  algorithm  is 
also  easily  vectorized  because  all  active  data  remain  in 
contiguous  blocks. 

Hence,  efficiency  can  be  obtained  with  preset  pivots, 
but  numerical  stability  will,  in  general,  require  a  different 
pivoting  strategy.  In  our  approach  these  two  requirements 
are  hardly  in  conflict,  because  many  highly  correlated  ma¬ 
trices  are  factored  in  the  course  of  the  continuation  proce¬ 
dure.  The  reasonable  belief  that  the  pivot  locations  can  be 
kept  constant  along  a  whole  piece  of  the  branch  is  indeed 
confirmed  by  our  experience.  Therefore,  both  numerical 
stability  and  load  balance  can  be  achieved  by  using  a  dy¬ 
namic  pivoting  strategy  occasionally  (when  the  growth 
factor  has  exceeded  some  limit),  followed  by  an  adapta¬ 
tion  of  the  data  distribution  to  the  new  pivot  locations. 

This  data  distribution  strategy  differs  from  most  others 
in  two  essential  aspects.  First,  it  takes  into  consideration 
the  global  behavior  of  the  program,  i.e.,  the  fact  that  the 
matrices  result  from  a  continuation  procedure.  Second, 
adapting  the  data  distribution  to  the  computation  itself 
is  an  integral  part  of  the  strategy.  In  section  4,  we  shall 
see  that  the  combination  of  these  two  ingredients  leads 
to  high  efficiency.  Here,  we  consider  the  implementation 
aspects  of  this  strategy. 

Because  the  data  distribution  is  adaptive  and  depends 
on  the  global  nature  of  the  continuation  program,  com¬ 
ponent  routines  like  the  LU-decomposition  and  the  back- 
solve  should  be  written  so  that  they  are  correct  indepen¬ 
dently  of  the  data  distribution.  For  such  routines,  the 
data  distribution  is  part  of  the  input  data  supplied  in  the 


argument  list  when  calling  the  routine.  We  use  the  LU- 
decomposition  described  in  [11]  and  its  companion  back- 
solve  algorithm.  To  achieve  independence  of  the  data  dis¬ 
tribution,  the  LU-decomposition  must  do  all  pivoting  im¬ 
plicitly  (otherwise  the  data  distribution  would  depend  on 
the  pivots!).  In  fact,  all  routines  called  by  our  program 
must  have  the  property  that  they  are  correct  indepen¬ 
dently  of  the  data  distribution.  If  we  consider  these  rou¬ 
tines  as  the  components  of  a  library,  the  necessity  for  this 
property  follows  from  the  observation  that  the  writer  of 
the  library  routines  cannot  know  the  global  properties  of 
the  program  in  which  this  routine  will  be  used.  Hence,  the 
data  distribution  cannot  be  fixed  at  the  time  of  writing 
the  library.  In  fact,  our  LU-decomposition,  matrix-vector 
operations,  and  other  related  linear  algebra  routines  are 
packaged  in  a  data-distribution-independent  library.  Our 
continuation  program  uses  this  library  and  imposes  a  data 
distribution  on  it  at  run-time. 

To  provide  maximum  flexibility,  our  LU-decomposition 
allows  pivoting  of  both  rows  and  columns.  Besides  allow¬ 
ing  classical  pivoting  strategies  (row,  column,  diagonal, 
and  complete),  this  flexibility  also  leads  to  two  intrinsi¬ 
cally  concurrent  pivoting  techniques  with  increased  nu¬ 
merical  stability  and  load  balance.  For  details  on  those 
techniques,  we  refer  to  [11].  For  the  discussion  of  our  con¬ 
tinuation  program  we  introduce  just  one  dynamic  strat¬ 
egy,  complete  pivoting,  and  one  static  strategy,  preset  piv¬ 
oting.  Complete  pivoting  is,  in  general,  overkill  since  nu¬ 
merical  stability  can  be  obtained  with  less  expensive  piv¬ 
oting  strategies.  However,  complete  pivoting  ensures  that 
the  pivot  locations  are  highly  unpredictable  and,  hence, 
illustrates  best  the  adaptivity  of  our  program.  Moreover, 
complete  pivoting  is  used  only  occasionally,  i.e.,  when 
the  growth  factor  exceeds  a  set  tolerance.  For  most  LU- 
decompositions,  we  use  preset  pivots,  determined  by  the 
last  LU-decomposition  with  dynamic  pivoting.  Hence, 
the  cost  of  dynamic  pivoting  is  amortized  over  many  LU- 
decompositions. 

4  Performance 

The  calculations  were  performed  on  a  Symult  Series  2010 
multicomputer  with  up  to  64  nodes.  We  investigate  the 
dependence  of  the  execution  time  on  the  data  distribu¬ 
tion  for  one  LU-decomposition.  Here,  we  used  64  nodes 
and  an  8  X  8  process  grid.  As  expected,  the  adapted  data 
distribution  turned  out  to  be  superior.  We  consider  also, 
for  each  fixed  strategy,  the  dependence  of  the  execution 
time  on  the  number  of  nodes.  We  used  2,  4,  8,  16,  32,  and 
64  nodes,  and  obtsiined  excellent  speedup  for  each  strat¬ 
egy.  For  absolute  performance,  we  made  a  comparison 
of  the  sequential  version  of  our  code  with  a  fully  opti¬ 
mized  C-version  of  the  LINPACK  benchmark  [10].  Due 
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Pivoting 

Distrib. 

Time(8) 

Spdp. 

Eff.(%) 

Complete 

Linear 

75.3 

41.4 

64.7 

Complete 

Random 

63.7 

49.9 

78.0 

Complete 

Scatter 

62.8 

46.2 

72.2 

Complete 

Adapted 

51.3 

54.2 

84.7 

Preset 

Linear 

48.9 

36.9 

57.7 

Preset 

Scatter 

40.3 

42.6 

66.6 

Preset 

Adapted 

33.8 

48.9 

76.4 

F.  Preset 

Adapted 

29.7 

50.0 

78.2 

Table  1;  LU-Decompoeition  times  for  a  25  x  25  grid 
problem  on  64  node  Symult  Series  2010.  Number  of 
megaFLOPS  is  based  on  Af^/3  floating  point  operations, 
where  M  —  25^  is  the  number  of  unknowns. 


to  memory  restrictions,  this  comparison  was  done  with  a 
random  300  x  300  matrix.  A  sequential  version  of  our  fast 
preset  pivoting  algorithm  ran  about  5%  slower  than  LIN- 
PACK.  (These  5%  result  from  the  fact  that  we  have  not 
implemented  a  number  of  low  level  optimizations  used  by 
LINPACK.) 

We  consider  the  example  of  section  2  with  h  =  2irf2^, 
i.e.,  the  number  of  unknowns  at  every  step  is  M  =  625. 
In  Table  1,  we  present  timings  for  one  (typical)  LU- 
decomposition  using  complete  pivoting  and  preset  pivot¬ 
ing  in  combination  with  different  data  distributions  for 
the  factored  matrix.  The  linear  and  scatter  distributions 
are  static  distributions.  The  linear  distribution  allocates 
blocks  of  contiguous  rows  and  columns  to  processes.  The 
scatter  distribution  uses  a  wrap  mapping.  The  adapted 
distribution  uses  the  pivot  locations  of  the  previous  LU- 
decomposition  to  distribute  the  current  matrix  such  that 
ideal  load  balance  is  achieved,  if  the  pivot  locations  of  the 
current  matrix  coincide  with  those  of  the  previous  matrix. 
In  the  version  “Fast  Preset”  of  preset  pivoting,  certain  ad¬ 
ministrative  overhead  is  eliminated  using  the  information 
that  the  pivots  are  preset  and  that  a  particular  distri¬ 
bution  is  used.  All  calculations  were  done  on  a  64  node 
machine  using  64  processes,  one  process  running  on  each 
node.  The  process  grid  was  partitioned  into  P  =  8  process 
rows  and  Q  =  8  process  columns. 

To  test  the  concurrent  performance  of  our  code,  we  de¬ 
termine  the  execution  time  as  a  function  of  the,^umber  of 
nodes.  The  same  example  as  in  Study  1  is  computed  suc¬ 
cessively  using  2,  4,  8,  16,  32,  and  64  nodes,  and  always 
choosing  the  number  of  processes  equal  to  the  number 
of  nodes,  one  process  running  on  each  node.  The  num¬ 
bers  P  and  Q  of  process  rows  and  columns  were  chosen 
equal  within  divisibility  constraints.  When  the  logarithm 
of  the  execution  time  is  plotted  as  a  function  of  the  log¬ 
arithm  of  the  number  of  processes,  ideal  speedup  is  char¬ 
acterized  by  a  straight  line  with  slope  -1  if  appropriate 
scales  are  used.  Figure  1  shows  that,  for  each  strategy. 


log2 (Number  of  Processors) 


Figure  1:  LU-Decomposition  times  for  a  25  x  25  grid  prob¬ 
lem  as  a  function  of  number  of  nodes  on  a  Symult  Series 
2010. 

the  execution-time  plot  is  almost  parallel  to  the  line  char¬ 
acterizing  ideal  speedup.  Table  1  can  be  used  to  identify 
the  individual  timing  plots. 

The  problem  was  too  big  to  run  on  a  one-node  machine. 
Precise  speedups  could  thus  not  be  calculated.  In  Table  1, 
we  give  speedups  and  efficiencies  with  respect  to  two-node 
timings,  i.e.,  the  real  speedup  is  estimated  by: 

SpQ  w  2  ♦  T2/TPQ, 
and  the  real  efficiency  is  estimated  by: 

iPQ  w  2  ♦  T2/{PQTpq). 

Here,  T2  is  the  two-node  timing  and  Tpg  is  the  timing 
with  PxQ  nodes.  Speed-up  and  efficiency  are  good  mea¬ 
sures  for  the  overhead  due  to  communication  and  load 
imbalance. 

When  varying  the  data  distribution  and  keeping  the 
pivoting  strategy  fixed,  it  is  clear  that  the  adapted  data 
distribution  is  the  most  efficient.  This  is  easily  explained 
by  the  increased  load  balance  of  the  adapted  data  distri¬ 
bution.  This  observation  holds  for  both  complete  pivoting 
and  preset  pivoting. 

When  comparing  efficiencies  for  the  same  distribution 
but  for  different  pivoting  strategies  (i.e.,  in  Table  1  com¬ 
pare  lines  1  and  5,  3  and  6,  4  and  7),  it  is  seen  that 
complete  pivoting  is  more  efficient.  This  is  because  the 
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pivot-search  cost  leads  to  a  higher  ratio  of  computation  to 
communication  time  for  complete  pivoting  than  for  preset 
pivoting. 

Another  interesting  observation,  which  follows  from  Ta¬ 
ble  1,  is  that  complete  pivoting  with  the  random  distri¬ 
bution  (line  2)  is  more  efficient  than  complete  pivoting 
with  the  scatter  distribution  (line  3).  The  execution  time, 
however,  is  lower  for  the  scatter  distribution.  The  ran¬ 
dom  distribution  is  better  than  the  scatter  distribution 
for  load  balancing,  and  hence,  has  higher  efficiency.  The 
random  distribution  leads  to  very  irregular  memory  access 
patterns,  however,  and  that  causes  the  absolute  execution 
time  to  be  larger. 
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Abstract 

The  quadratic  sieve  algorithm  is  a  powerful  method 
for  factoring  large  integers  up  to  100  digits.  In  this 
paper,  we  study  in  detail  each  step  of  the  algorithm, 
in  order  to  derive  efficient  parallel  implementations 
on  distributed  memory  multiprocessors.  Our  aim  is 
to  prove,  taking  the  quadratic  sieve  algorithm  as  a 
revealer,  that  very  efficient  programming 
methodologies  could  be  derived  in  order  to  take  the 
most  out  the  target  architecture.  We  describe  an 
implementation  of  the  quadratic  sieve  algorithm  on 
the  FPS  T40  hypercube.  We  evaluate  the  solution 
through  the  results  we  obtain,  and  particularly  the 
super  linear  speedup.  We  try  to  explain  these 
speedups.  From  these  experimental  considerations, 
we  derive  more  efficient  implementations,  including 
totally  equidistributed  tasks  among  the  processors. 
Some  other  refinements  may  be  added  to  the 
algorithm. 


I  ••  Introduction 

The  quadratic  sieve  algorithm  [Pom  85],  [Sil  87]  is  a 
powerful  method  for  factoring  large  integers  up  to 
100  digits.  Various  authors  have  ^ready  published 


1  This  work  has  been  partially  supported  by  the 
Coordinated  Research  Program  C3  of  the  French 
CNRS  and  theDRET. 


interesting  results  on  parallel  implementations  of  this 
algorithm  [PST  88],  [CaS  88].  From  an  algorithmic 
point  of  view,  these  parallel  implementations  are' 
trivial;  the  only  part  of  the  algcnithin  to  be  parallelized 
is  the  sieve  itself.  Indeed,  it  is  the  most  consuming 
step,  both  in  time  and  space,  so  that  most  of  the 
parallel  versions  of  the  quadratic  sieve  algorithm 
have  been  implemented  on  networks  of  independent 
machines  [CaS  88],  [Sil  88],  [LeM  89].  But  some 
other  versions  have  been  implemented  on  very 
powerful  vector  computers,  taking  advantage  of  the 
large  memory  space  and  the  high  power  of  the 
processors  of  the  Cray  XMP  for  example  [DHS  85], 
[RLW  88].  Only  one  attempt  is  known  to  us  to 
implement  this  ^gorithm  on  a  distributed  memory 
multiprocessor,  by  Davis  and  Holdridge  [DaH  88], 
on  the  Ncube. 

We  propose  to  modelize  all  the  steps  of  the  algorithm 
for  an  implementation  on  distributed  memory 
multiprocessors;  the  initialization  phase,  the 
generation  of  the  polynomials,  the  sieve,  the  choice 
of  the  w(x)  candidates  to  factorization,  the 
factorization  of  these  w(x),  and  the  Gaussian 
elimination  on  the  w(x)  matrix. 

In  this  paper,  we  study  in  detail  each  step  of  the 
algorithm,  in  order  to  derive  efficient  parallel 
implementations  on  distributed  memory 
multiprocessors.  The  target  architecture  is  the  FPS 
T40  hypercube,  crudely  parameterized  in  terms  of 
communication  facilities.  The  FPS  T40  is  a  32 
processors  (T414)  hypercube  with  Weitek  floating 
point  coprocessors;  the  communication  is  only 
possible  between  neighbors  of  a  5-dimensional 
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hypercube  topology;  communication  implies  local 
synchronization  since  it  is  done  through  a  "rendez¬ 
vous"  protocol. 

Our  aim  is  to  prove,  taking  the  quadratic  sieve 
algorithm  as  a  revealer,  that  very  efficient 
programming  methodologies  could  be  derived  in 
order  to  take  the  most  out  of  the  architecture.  In  the 
case  of  the  FPS  T40,  load  balancing  the  computation 
phases,  overlapping  computation  and  communication 
and  restricting  the  message  exchanges  to  neighbors 
are  the  crucial  features.  Interconnection  topologies  of 
communicating  processes  are  reduced  to 
subtopologies  of  the  hypercube. 


II  -  The  quadratic  sieve  algorithm 

Suppose  that  an  integer  N  is  not  a  prime  number  (this 
can  be  verified  through  a  Fermat's  test),  and  N  has 
no  small  prime  factors  (they  have  been  found  with  a 
sieve  of  Eratosthenes). 

Here  is  the  algorithm  to  factor  N,  with  the  quadratic 
sieve  method: 


Begin  algorithm 
Initidization  step: 

Compute  and  store: 

k  (number  of  elements  in  the  base), 

B  (the  smallest  power  of  2  greater  than  each 
element  of  the  base), 

M  (the  upper  bound  of  the  sieve  interval). 
Compute  and  store  the  k  elements  p,  of  the 


base,  with  pi  prime  and  =  1 . 

Compute  and  store  the  solutions  of  the 

equations  =  N  mod  pf ,  with  pf  <  B. 

End  initialization  phase 


While  ( not  enough  factored  w(x) )  do 
ii.Compute  a  polynomial  w(x)=a2x2+2bx+c: 
Compute  a:  a  must  be  prime,  a  =  4k+3, 

anearY^.and(f)=  1. 

2 

Deduce  b:  b  =  N  ~  mod  a^,  and  |  b|  <  ^. 
Deduce  c:  c  = 

a?- 


ij:.Sieve  with  each  element  pi  of  the  base  and 

each  power  a,  such  that  pf  ^  B: 

Initialize  the  array  tab_sieve[-M,  M[  to  0. 

For  each  pf  do 

Find  the  starting  index  s  near  -M. 

Repeat 

Add[log2  pj  to  tab_sieve[x]. 

s  =  s  +  pf . 
until  s  >  M 
endfor 

4  -  Define  a  bound  V. 

For  each  x  in  [-M,  M[  with  tab_sieve[x]>V  do 
Compute  w(x). 

5  -  Try  to  factor  w(x)  over  the  base.  Each 
factored  w(x)  is  stored  as  a  line  of  the  matrix  M. 
Endfor 
End  while 

6  -  Perform  a  gaussian  elimination  on  the 
matrix  M  of  the  w(x). 

Compute  the  GCD  of  some  dependent  lines  of 
the  matrix  M.  The  GCD  is  a  cofactor  of  N. 

End  algorithm. 

Figure  1.  The  MPQS  Algorithm. 


1  -  The  initialization  step 
The  initialization  step  mainly  computes  the  k 
elements  of  the  base  of  prime  numbers  to  be  used  in 
remaining  steps.  This  computation  can  be  achieved  in 
a  full  parallel  efficiency  when  using  the  P 
processors,  indexed  by  i  (i=0...P-l),  as  a  farm  of 
processors.  The  k  elements  are  statistically  lower 
than  2k  log(2k).  Each  processor  i  is  assigned  an 


interval 


■  2klog(2k)  ^  ^^^■^^^2klog(2k) 


.  In  this 


interval,  there  are  statistically  ^  elements.  The 

processors  search  for  all  the  elements  in  their  proper 
interval. 


Once  they  have  completed  this  step,  we  gather  and 
sum  on  processor  0  (in  log(P)  communication  steps) 
the  effective  number  m  of  elements  that  each 
processor  has  found.  Then,  processor  1  searches  for 
the  (k-m)  missing  elements  to  get  a  distributed  base 
of  k  elements,  while  the  other  processors  equally 
distribute  the  k  elements  of  the  base  on  the  P-1 
remaining  processors. 
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Balancing  the  elements  between  the  (P-1)  remaining 
processors  can  be  achieved  in  log  P  communication 
steps  using  a  all  to  one-type  gathering  algorithm  on  a 
spanning  tree.  Let  us  describe  in  more  detail  the 
algorithm  for  a  hypercube  topology.  The  following 
figure  shows  the  second  step  of  this  strategy: 


0001 


11,10  1101  1011  ohi 
nil 


Figure  2.  Second  step  on  a  a  4-dimensional  hypercube 


Consider  a  spanning  tree  of  the  hypercube  rooted  in 
processor  0,  such  that  processor  1  is  the  subtree 
reduced  to  one  element.  Isolate  processor  1  during  its 
computation  phase  (computation  of  the  (k-m) 
missing  elements  in  the  base).  Each  leaf  of  the  tree 
exchanges  some  elements  with  its  father  in  order  to 
get  the  exact  number.  Recursively  suppress  the 
leaves  and  repeat  the  process  on  a  spanning  tree  of  a 
hypercube  of  reduced  dimension. 

After  log  P  communicating  steps,  all  the  processors, 
except  processors  0  and  1,  have  the  exact  number  of 
elements.  As  soon  as  processor  1  has  computed  the 
(k-m)  missing  elements,  0  and  1  load  balance  their 
elements.  This  algorithm  works  for  any  initial 
distribution  of  the  number  of  elements  we  have 
encountered,  thanks  to  the  distribution  of  primes  in 
intervals. 

Let  us  present  the  TNode  and  the  iPSC  distributed 
memory  multiprocessors.  The  TNode  computer  has 
32  to  128  processors  (TSOO  Transputer)  with  on  chip 
floating  point  arithmetics;  a  dedicated  network  allows 
dynamic  reconfiguration  of  the  interconnection 
topology  (network  of  maximum  degree  4); 
reconfiguration  implies  global  synchronization 
points;  communication  implies  local  synchronization 
since  it  is  done  through  a  "rendez-vous"  protocol. 


The  iPSC  has  32  to  64  processors  (80286)  with 
floating  point  coprocessors;  routing  VLSI  devices 
allow  for  point  to  point  communications  between  the 
processors  of  the  hypercube;  messages  are  stored  in 
a  FIFO  queue  for  each  processor,  which  implies  no 
synchronization  although  the  exchanges  must  be 
done  carefully. 

This  technique  is  well  suited  for  the  FPS  T40 
computer  since  it  is  basically  a  hypercube.  In  the  case 
of  the  TNode,  the  network  has  to  be  configurated  in 
such  a  topology  that  its  diameter  be  of  order  of  log  P. 
For  example,  we  could  use  a  perfect  shuffle  ring 
network.  In  the  case  of  the  iPSC  hypercube,  this 
technique  works  when  taking  care  of  the  amount  of 
data  transmitted. 

The  next  step  consists  in  computing  the  roots  of  the 
equations  x^  =  N  mod  p“,  for  each  element  pi  of  the 

base,  and  each  integer  a  such  that  p“  <  B.  Before 
computing  these  roots,  we  have  to  distribute  the  pi's 
such  that  the  computation  of  the  roots  takes  the  same 
time  on  all  the  processors.  Typically,  each  pi  implies 

a  equations  to  be  computed.  For  large  pi's,  a  =  1 . 
But  for  small  pi’s,  a  >  1 .  So,  if  each  processor  has 

the  same  number  ^  of  elements,  the  processors  with 

the  smallest  values  of  the  pi's  have  more  work  to 
perform  than  the  other  processors. 

This  is  the  reason  why  we  slightly  modify  the  step 
for  the  repartition  of  the  k  elements  of  the  factor 
base.  One  still  computes  the  total  number  of 
equations  to  be  solved.  And  instead  of  balar.'ring  the 
k  elements  over  the  P  processors,  one  distributes  the 
elements  such  that  the  number  of  equations  is  the 

same  on  each  processor.  But  to  solve  x^  =  N  mod 

pf,  with  a  >  2,  one  needs  the  solution  of  x^  =  N 

mod  pf  ^  So  all  the  equations  for  a  given  pi  must  be 
solved  on  the  same  processor.  Otherwise  some 
communications  occur  and  waste  time.  And  it  is  not 
efficient  anymore.  Hence,  under  the  constraint  that 
the  equations  for  a  given  pi  are  solved  on  a  single 
processor,  the  tasks  may  not  be  strictly  balanced.  But 
each  processor  can  perform  independently  and  with 
maximum  efficiency  this  computation.  The  roots  are 
stored  on  each  processor. 

The  computation  within  the  initialization  phase  is  all 
done.  Depending  on  the  requirements  of  the  sieving 
phase  of  the  algorithm,  one  has  to  broadcast  the  parts 
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of  the  base  from  each  processor  to  all  the  other  ones. 
As  the  amount  of  data  to  be  broadcast  is  large,  one 
can  use  the  broadcast  algorithms  in  an  hypercube 
given  by  Ho  and  Johnsson,  to  achieve  low 
communication  costs  [HoJ  89]. 

2  -  The  generation  of  the  polynomials 

The  polynomials  in  the  quadratic  sieve  are  used  to 
generate  integers  that  factor  on  the  base.  There  are 
many  ways  for  generating  the  polynomials  on  a 
distributed  multiprocessor,  depending  on  the  way 
both  the  elements  of  the  base  and  the  sieve  interv^ 
are  distributed  among  the  processors. 

-  The  polynomials  can  be  generated  before  the  loop. 
In  this  case,  the  generation  can  be  done  with  full 
parallelism,  such  that  no  communication  occurs, 
and  no  processor  is  idle  while  other  ones  are 
working.  The  polynomials  are  present  in  the 
network  but  they  are  distributed  among  the 
processors.  They  take  a  large  space  (680  Mbytes 
for  a  100-digit  N).  This  is  unrealistic  for  our 
examples  of  distributed  memory  multiprocessors. 

-  Or  they  can  be  generated  in  the  sieving  loop. 

•  The  polynomials  can  be  generated  by  each 
processor.  But  redundant  work  is  done,  if  the 
processors  have  parts  of  the  base  and/or  parts  of 


the  sieve  interval,  because  all  of  them  have  to 
generate  the  same  polynomials. 

•  The  polynomials  can  be  generated  by  a  dedicated 
master  processor.  This  master  sends  the 
polynomials  to  the  slaves  that  sieve.  We  have  to 
decide  whether  the  slaves  have  the  whole  base 
and  the  whole  sieve  interval,  or  if  one  of  these 
entities  or  both  are  distributed  among  the  slaves. 
If  they  have  the  whole  base  and  the  whole 
interv^,  each  slave  must  sieve  with  its  proper 
polynomials.  So  the  master  must  send  different 
polynomials  to  each  slave  (many  sequential 
communications).  In  the  other  cases,  the  master 
must  send  the  same  polynomial  to  each  slave  (a 
single  one_to_all  communication). 

It  is  not  clear  to  decide  which  solution  needs  the 
minimum  execution  time  (see  figure  3  below).  In  the 
farm  of  independent  processors  (1),  a  compensation 
factor  may  appear.  Indeed,  the  time  needed  for 
computing  a  new  polynomial  is  the  same  on  all  the 
processors.  But  the  time  needed  for  sieving  may  be 
slightly  different.  And  this  cannot  occur  in  the 
Master/Slaves  solution  (2):  of  course  there  is  no 
redundant  work,  but  we  can  see  that  communications 
and  synchronizations  will  never  let  compensation 
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Figure  3.  Different  timings  for  the  execution  with  or  without  a  Master 
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play  its  role...  Indeed,  with  the  Master  /  Slaves 
strategy,  the  time  needed  to  do  all  the  work  with  one 
polynomial  is  the  time  needed  by  the  slowest 
processor  to  do  its  computation.  And  furthermore, 
only  P-1  processors  sieve  if  a  processor  is  dedicated 
to  the  computation  of  the  polynomials,  but 
P  processors  sieve  with  the  first  strategy,  because 
none  of  them  is  dedicated  to  another  task. 

3  -  The  sieve 

This  step  is  high  time  consuming.  Its  implementation 
depends  on  the  distribution  of  the  elements  of  the 
base,  the  sieve  interval,  and  the  polynomials.  For  a 
100-digit  integer,  the  base  takes  1  Mbytes  and  the 
sieve  interval  13  Mbytes.  These  values  are  theoretical 
and  depend  on  the  target  machine.  The  polynomials 
are  generated  one  at  a  time. 

The  implementation  of  [CaS  88]  is  such  that  each 
processor  is  a  workstation  and  can  handle  the  whole 
base  and  the  interval-  The  idea  is  to  give  each 
workstation  different  polynomials  on  distributed 
memory  multiprocessors.  We  cannot  really  work  this 
way,  b^ause  of  memory  requirements.  But  we  can 
communicate  between  processors  and/or  store  and 
recall  data  on  disks.  For  example,  the  FPS  T40 
distributed  memory  hypercube  multiprocessor  has 
32  1 -Mbyte  nodes.  It  is  impossible  to  place  both  the 
base  and  the  interval  on  each  processor.  We  have  to 
distribute  them  and/or  store  them  on  external  disks. 
When  a  processor  needs  data,  it  can  either  initiate  a 
communication  with  a  neighbor,  or  read  from  a  disk 
if  the  data  is  too  far  in  terms  of  communications.  The 
choice  depends  also  on  the  communication  speed 
between  processors  and  the  speed  between  a 
processor  and  a  disk. 

On  the  FPS  T40,  the  system  and  the  program  use 
250  Kbytes  on  each  node.  We  have  750  Kbytes  to  be 
distributed  between  a  sub-interval  and  a  subset  of  the 
elements  of  the  base.  The  best  theoretical 
compromise  is  to  have  a  large  sub-interval  and  a 
reduced  subset  of  the  elements  of  the  base.  This 
implies  the  minimum  total  communication  time. 
Whatever  the  base  and  the  interval  divide  the 
750  Kbytes,  communicating  between  neighbors  is 
always  faster  than  reading  data  from  disks. 

4  -  The  choice  of  the  w(x)  candidates  for  a 
complete  factorization 

This  step  is  a  linear  search  in  the  interval  sieve  to  find 
those  w(x)  that  will  probably  factor  on  the  base.  If 
the  interval  is  completely  distributed  among  the 
processors  or  if  each  processor  has  the  whole 


interval  (sieved  with  different  polynomials  on 
different  processors),  this  step  can  be  achieved  in  a 
total  parallel  way.It  is  completed  for  each  new 
polynomial. 

5  -  The  factorization  of  these  w(x) 

If  each  processor  has  the  whole  base,  this  step  can  be 
achieved  in  a  parallel  manner,  without  any  interaction 
between  processors.  If  not,  a  pipeline  algorithm  can 
be  implemented,  such  that  each  w(x)  be  divided  by 
all  the  elements  of  the  base  (the  elements  are 
distributed  among  the  processors).  This  step  is  done 
using  a  second  sieve.  A  possible  implementation  is  to 
use  a  logical  ring  and  to  synchronously  rotate  the 
information  from  one  processor  to  the  other.  The 
smallest  amount  of  data  has  to  be  communicated:  a 
choice  is  necessary  between  the  sub-interval  and  the 
subset  of  the  elements  of  the  base. 

6  -  Gaussian  elimination 

The  solution  of  large  dense  linear  systems  of 
algebraic  equations  in  finite  fields  appears  to  be  the 
computational  kernel  of  many  important  algorithms 
of  computer  algebra.  In  particular,  the  large  integer 
factoring  routines  in  their  final  stage,  such  as  die 
quadratic  sieve  algorithm,  compute  the  nullspace  of 
the  transpose  of  a  matrix  A,  i.e.  the  vectors  x 
solution  of  the  ^nation  ‘x.A  =  0.  The  quadratic  sieve 
algorithm  requires  the  factorization  of  a  matrix  of 
more  than  30,000  rows  and  columns  for  a  100-digit 
integer.  However,  gaussian  elimination  requires  an 
amount  of  arithmetic  operations  proportional  to  the 
cube  of  the  order  of  the  matrix,  and  this  leads  to 
serious  limitations  both  on  the  size  of  the  problems 
that  can  be  dealt  with,  and  the  speed  of  the  solution, 
when  implemented  on  sequential  computers.  An 
implementation  is  possible  on  a  linear  array  of 
processors  [CoR  87]. 


Ill  -  Implementation  on  the  FPS  T40 
hypercube 

The  FPS  T40  hypercube  is  viewed  as  a  network  of 
transputers,  since  we  do  not  use  the  facilities 
provided  by  the  Weitek  coprocessors.  The  programs 
are  written  with  the  C  language,  allowing  for 
dynamic  use  of  the  memory.  All  the  processors  have 
almost  the  same  program  code  but  work  on  different 
data.  The  infinite  integer  precision  package  is  due  to 
J.L.  Roch  [Roc  90].  It  is  very  powerful  and  we  use 
it  during  the  initialization  phase,  during  the  sieve 
phase  for  generating  new  polynomials  and  factoring 
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some  w(x)  and  for  computing  gcd's  at  the  end  of  the 
execution. 

We  do  not  implement  any  of  the  algorithmic  or 
mathematical  variations  that  can  be  found  in  the 
hterature. 

1  -  Implementation 

Concerning  the  initialization  phase,  we  implement  the 
theoretical  study  presented  above.  Each  processor 
receives  a  subinterval  of  the  total  interval  where  the  k 
elements  of  the  base  may  statistically  be.  Each 
processor  searches  for  all  the  elements  in  its  set. 
Then,  a  reduction  phase  occurs  in  order  to  count  the 
number  of  found  elements.  On  a  spanning  tree 
embedded  in  the  hypercube,  all  the  processors  send 
to  their  father  their  elements  number.  The  father 
sums  the  numbers  and  sends  them  to  its  father,  up  to 
the  root. 

The  root  designates  a  processor  PE  (with  no  sons  in 
the  tree)  which  searches  for  all  the  missing  elements. 
Then  each  processor  sends  to  or  receives  from  its 
father  some  elements  in  order  to  get  the  right  number 
of  elements.  This  is  necessary  to  equidistribute  the 
work  for  the  computation  of  the  square  roots  of  N. 
When  processor  PE  has  computed  all  the  missing 
elements,  it  exchanges  some  of  them  with  the  root 
and  thus  both  of  them  get  the  right  number  of 
elements.  That  is  why  computing  the  square  roots  of 
N  can  be  then  executed  with  maximum  efficiency, 
because  it  takes  the  same  time  for  all  the  processors. 

We  can  say  now  that  the  factor  base  (elements  and 
square  roots)  is  completely  computed.  But  it  is 
distributed  among  the  processors.  Our 
implementation  requires  that  each  processor  knows 
the  whole  base.  This  means  that  each  processor  has 
to  send  its  pan  of  the  base  to  all  the  other  processors, 
and  has  to  receive  from  all  the  other  processors  their 
own  part  of  the  base.  This  can  be  done  through  a 
all_to_all  communication  procedure.  Then  the  whole 
factor  base  is  present  on  each  processor.  This  implies 
a  strong  redundancy,  but  avoids  future 
communication  steps. 

The  sieve  phase  is  implemented  as  follows.  We 
distribute  the  polynomials  among  the  processors. 
Hence  each  processor  has  a  proper  family  of 
polynomials,  the  whole  factor  base  (which  is  already 
the  case),  and  the  whole  sieve  interval.  This  intervi 
is  view^  as  an  array  indexed  on  [-M,  M[.  During 
this  phase,  the  processors  can  work  independently  as 
a  farm  of  processors.  Each  of  them  has  to  find  1^32 


lines  of  the  matrix,  in  order  to  get  k  lines  among  all 
the  processors.  In  fact,  about  0.96k  lines  may 
suffice.  The  processors  do  not  need  to  communicate 
since  they  all  perform  a  complete  quadratic  sieve 
factorization,  but  with  a  restrict^  set  of  polynomials. 
Of  course  each  one  of  them  generates  a  small  part  of 
the  whole  matrix.  All  the  parts  are  gathered  at  the  end 
of  the  execution  to  build  the  globd  matrix  M  of  the 
prime  decomposition  of  the  factored  w(x). 

This  matrix  is  embedded  in  Z/2Z.  A  gaussian 
elimination  is  performed.  And  with  a  probabihty  0.5, 
each  linear  dependency  leads  to  a  non  trivial  cof^actor 
of  N. 

2  -  Results 

We  have  factored  small  integers  in  the  range  35-41 
decimal  digits.  The  execution  times  are  given  on  the 
following  figure  4. 


34  35  36  37  38  39  40  41  42 

Number  of  digits 

Figure  4.  Execution  lime  vs  size  of  N 

Figure  4  shows  that  this  algorithm  is  not  linear  in 
regard  of  the  number  of  digits  of  N.  In  fact,  the 
heuristic  run  time  is  [Pom  82]: 

exp((l  +  o(l))Vln  N  In  In  N  ). 


Speedup 


Number  of  processors 
Figure  5.  Speedups  for  a  41-digit  integer 
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In  order  to  compute  experimental  speedups,  we  tried 
to  factor  a  41 -digit  integer.  Figure  6  depicts  the 
experimental  speedup  curve  compared  with  the  linear 
speedup.  The  obtained  performance  is  surprising: 
with  32  processors,  the  speedup  (computation  time 
with  one  processor  compared  with  computation  time 
with  32  processors)  is  equal  to  132. 


Figure  6.  Speedup  when  factoring  a  41 -digit  integer 

The  superlinear  speedups  come  from  the  growth  of 
memory  space  when  increasing  the  number  of 
processors.  As  the  dynamic  memory  space  is  mainly 
used  to  compute  the  polynomials,  increasing  the 
memory  space  avoids  parts  of  the  management  of 
this  memory,  i.e.  less  "allocate"  and  "free"  procedure 
calls. 

3  •  Load  balancing  the  computation 
We  can  see  on  more  acurate  results  that  the  work  is 
not  equally  distributed  among  the  processors. 
Indeed,  the  slowest  processor  needs  twice  the  time  of 
the  fastest  processor  to  complete  its  task. 
Furthermore,  we  can  see  that  this  time  is  directly 
linked  to  the  number  of  polynomials  that  this 
processor  has  to  generate  and  work  with,  to  get  its 
k/32  lines.  The  following  figures  7  and  8  show  that 
the  ratio  of  the  slowest  processor  time  to  the  fastest 
processor  time  is  close  to  2.  The  workload  is  75  %. 


Time  in  s 


Processor  identity 

Figure  7.  Time  to  compute  polynomials  per  processor 


Time  in  s 


Processrx'  identity 

Figure  8.  Total  execution  time  per  processor 


In  order  to  improve  the  workload,  we  have  to  load 
balance  the  computation  by  letting  the  processors 
work  with  the  same  number  of  polynomials. 

If  we  can  know  a  priori  how  many  polynomials, 
say  G,  are  required  to  get  the  k  lines,  each  processor 
may  work  with  G/32  polynomials.  Caron  and 
Silverman  [CaS  88]  have  experimentally  studied  the 
size  of  the  factor  base,  the  size  of  the  sieve  interval 
and  the  number  of  polynomials  necessary  for  the 
factorization  of  a  n-digit  integer.  Using  these 
informations,  we  can  compute  an  approximate  value 
of  G  (Figure  9). 
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Figure  9.  Number  of  polynomials  vs  size  of  N 


Hence  the  improved  algorithm  computes  first  the 
niunber  of  polynomials  necessary  to  the  factorization 
and  then  each  processor  works  with  G/32 
polynomials  in  order  to  get  the  desired  total  number 
of  rows. 


Using  this  new  strategy,  the  difference  between  the 
slowest  and  the  fastest  processors  is  drastically 
reduced:  each  step  of  the  algorithm  is  well  balanced 
on  each  processor. 

Figure  10  shows  the  factorization  time  for  larger 
values  of  N  in  the  range  35-60  decimal  digits. 


(For  N  with  about  60  digits,  it  takes  approximately 
31.5  hours  to  factor). 

The  percentage  of  the  time  during  which  the  fastest 
processor  is  idle  while  at  least  one  processor  is  still 
working  is  shown  on  the  following  figure  11. 

Difference  in  % 


Figure  11.  Difference  of  time  (%)  between 
the  slowc  ..I  and  the  fastest  processors 

The  workload  is  better.  The  following  table  gives  the 
execution  time  of  the  slowest  and  fastest  processors 
and  the  workload  for  integers  in  the  range  41-60 
digits. 
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41 
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97.4  % 

51 

6,552  s 

6,742  s 

98.6  % 

54 

18,152  s 

lfe,54^  s 

99.1  % 

66 

112,284  s 

11^,613  s 

9^.5  %  ~ 

Figure  12.  Execution  times  and  workload 


IV  -  Comparison  FPS  T40,  iPSC,  TNode 

Each  node  of  the  iPSC/2  hypercube  [Arl  88]  is  a 
80286  processor  with  a  80287  coprocessor.  It  can 
have  4  Mbytes  memory  per  node.  Suppose  we  have 
an  iPSC  with  32  nodes. 

To  compare  an  implementation  on  the  iPSC/2  and  on 
the  FPS  T40,  one  must  say  that  the  Transputer  T414 
and  T800  are  RISC  microprocessors  (7.5  Mips)  and 
the  80286  is  a  CISC  microprocessor  (4  Mips).  For 
our  application,  one  can  say  that  the  80286  is  more 
powerful  by  a  factor  approximately  2.  But  the  most 
important  feature  is  the  memory  space.  As  the 
generation  of  the  polynomials  is  high  dynamic 
memory  consuming,  we  can  say  that  multiplying  the 
memory  space  by  a  factor  4,  will  allow  to  speed  up 
the  execution  by  a  factor  of  6.  Both  these  speedups 
together  will  bring  a  speedup  of  about  12  on  the 
iPSC,  compared  to  the  FPS  T40.  This  means  that  we 
can  factor  integers  of  the  same  size,  12  times  faster 
on  the  iPSC/2  than  on  the  FPS  T40,  or  we  can  factor 
a  70-digit  integer  on  the  iPSC/2  within  the  time 
needed  for  a  60-digit  integer  on  the  FPS  T40. 


The  TNode  multiprocessor  [ChT  89]  is  not  an 
hypercube.  But  with  the  dynamic  reconfiguration 
facilities,  it  is  possible  to  use  it  as  a  tree,  for  the 
initialization  phase,  for  example.  The  TNode  is  based 
on  the  T800  Transputer,  which  has  approximately 
the  same  features  as  the  T414  Transputer,  for  what 
concerns  integer  arithmetic.  It  is  more  powerful 
when  using  floating-point  arithmetic.  But  we  do  not 
need  these  facilities.  So  a  TNode  with  32  nodes  and 
1  Mbytes  memory  per  node  will  not  be  much  faster 
than  the  FPS  T40.  But  by  using  a  MegaNode  (128 
nodes),  it  is  clear  that  we  can  reach  higher  limits, 
because  of  the  memory  increase. 
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V  •  Conclusion 

Siverman  [Sil  88]  noted  that  in  certain  cases,  one  can 
obtain  better  than  linear  speedup  by  partitioning  an 
algorithm  among  machines  due  to  effects  such  as 
much  greater  real  memory  and  having  multiple  data 
caches.  Indeed  even  on  a  tightly  coupled  MIMD 
massively  parallel  architecture,  such  speedups  can  be 
reached.  This  clearly  shows  the  importance  of  the 
memory  requirements  for  the  MPQS  algorithm. 
However,  with  1  Mbytes  of  memory  per  processor, 
efficient  results  can  be  obtained. 

Our  results  compare  favorably  with  those  obtained 
by  Davis  and  Holdridge  [DaH  88]  on  the  NCube  of 
1,024  nodes,  each  with  512  Kbytes  of  local  memory: 
the  difference  of  time  is  bound^  by  a  factor  of  2. 

We  are  implementing  a  completely  parallelized 
version  of  the  quadratic  sieve  algorithm  on  the 
FPS  T40  target  architecture.  First  experiments  show 
that  this  implementation  could  be  done  in  very 
efficient  ways.  In  particular,  it  is  worth  noting  that 
the  bulk  of  computation  work  resides  in  the  sieve 
itself  (§  3).  This  part  must  be  carefully  handled  and 
the  most  efficiently  parallelized.  For  a  100-digit 
integer  it  seems  to  represent  about  70%  of  the  work. 
If  the  polynomials  are  generated  by  the  processors  in 
the  sieve  loop,  a  farm  of  processors  could  be  used 
with  high  efficiency  and  very  low  redundant  work, 
but  some  redundant  data. 

But  some  other  refinements  are  possible  (algorithmic 
and  mathematical  refinements).  They  are  currently 
under  implementation,  and  will  drastically  decrease 
the  total  execution  time. 
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Abstract 

This  paper  describes  work  done  on  the  1024  node 
NCUBE  hypercube  at  the  University  of  South  Car¬ 
olina.  in  developing  methods  for  efBcient  local  solu¬ 
tion  of  unconstrained  minimization  problems.  The 
paper  begins  with  a  mathematical  discussion  of  quasi- 
Newton  methods  for  unconstrained  optimization,  and 
specifically  Broyden’s  Method.  Next  it  presents  the 
parallel  methods,  and  discusses  the  parallel  implemen¬ 
tation  of  the  most  common  Broyden  method.  Finally 
it  lists  some  numerical  results  to  evaluate  the  perfor¬ 
mance  of  the  parallel  Broyden  methods. 

Introduction 

Many  types  of  problems  have  benefited  from  the 
use  of  high  speed  parallel  processors.  The  inherent 
parallelism  in  these  problems  has  been  exploited  in 
order  to  achieve  solutions  with  the  Scune  order  of  accu¬ 
racy  (or  better),  shorter  times  to  solution,  and/or  solu¬ 
tions  to  problems  that  would  have  been  intractable  on 
a  conventional  computer.  Examples  of  these  types  of 
problems  include  fluid  dynamics,  particle  mechanics, 
and  linear  systems.  One  type  of  problem  that  has  not 
been  as  well  studied  is  nonlinear  optimization  (or  its 
companion  problem  nonlinear  systems  of  equations). 

Traditionally,  quasi-Newton  methods  have  been 
employed  to  find  approximate  solutions  by  iteration. 
These  methods  yield  high  order  accuracy,  and  provide 
superlinear  convergence  in  a  neighborhood  of  the  solu¬ 
tion.  The  most  promising  quasi-Newton  methods  are 
dubbed  “secant”  methods  since  they  follow  a  secant 
line  through  the  previous  iterate  to  select  the  next  it¬ 
erate. 

Let  /  :  n  C  R”  — »  R  be  a  convex  function.  As¬ 
sume  that  there  is  some  a;,  €  f?  for  which  /(x.)  <  /(r) 
for  all  r  €  n.  The  usual  implementation  of  a  secant 
method  (with  rank-one  update)  to  find  the  minimizer 
X,  of  the  smooth  function  /  is 

HcSc  =  -gc  (1) 

X+  =  Xc  +  Sc  (2) 

H^.  =  Hc  +  uv'^  (3) 

where  Xc  is  the  current  iterate,  gc  is  the  gradient  of 
/  at  the  current  iterate.  He  is  an  approximation  to 
the  Hessian  of  /  at  the  current  iterate,  x+  is  the  next 


iterate,  and  H+  is  the  approximation  to  the  Hessian 
at  the  next  iterate,  which  will  be  used  iu  the  following 
step  in  lieu  of  recomputing  an  approximate  Hessian  of 
/  at  x+  (see  [2]).  The  vectors  u  and  v  will  be  chosen 
so  that  H+  will  satisfy  the  secant  equation: 

H+Sc  =  g+-  ge. 

We  will  follow  the  standard  convention  of  denoting 
yc  =  g+~ge-  Thus  the  sec2mt  equation  can  be  rewritten 
as 

/f+sc  =  yc  (4) 

Initial  investigations  by  Byrd,  Schnabel,  and 
Schultz  [1]  into  developing  a  parallel  secant  method 
for  unconstrained  optimization  focus  almost  entirely 
on  performing  the  linear  algebra  calculations  in  paral¬ 
lel  and  using  simultaneous  function  evaluations.  While 
their  approach  is  appealing  for  use  on  a  vector  proces¬ 
sor  and  the  results  are  good,  the  inherent  parallelism 
of  the  problem  is  left  untapped.  One  would  hope  for 
a  more  effective  method  for  use  on  a  multiprocessor. 

As  an  alternative  to  the  BSS  method  for  mini¬ 
mizing  a  multivariate  nonlinear  function,  we  consider 
fixing  the  current  iterate  and  decomposing  the  descent 
direction  into  its  axial  components,  then  allowing  each 
processor  to  compute  its  part  of  the  next  iterate  and 
update.  To  develop  this  method  we  must  examine  the 
inverse  Broyden  method. 


Broyden’s  Method 


Recall  the  quasi-Newton  method  presented  above 
in  (l)-(3).  As  yet,  there  has  been  no  mention  of  how  to 
select  the  vectors  u  and  v  used  to  perform  the  rank-one 
update  to  the  matrix  He-  In  1965,  Charles  Broyden 
proposed  the  update 


H+ =  He + 


iVc  -  HeSe)sJ 


where  x+,  Xj,  s^,  and  j/e  are  as  before.  Since  that  time, 
the  quasi-Newton  methods  derived  from  the  use  of 
similar  updates  have  become  known  as  Broyden  meth¬ 
ods.  Note  that  the  above  update  gives  the  least  change 
in  the  affine  model  while  remaining  consistent  with 
the  secant  equation,  (4).  If  instead  of  using  the  ap¬ 
proximate  Hessian,  we  use  an  approximate  inverse  of 
the  Hessian,  we  can  formulate  the  inverse  Broyden 
method. 
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The  inverse  secant  method  takes  the  form: 

X+  =Xc-  AcQc  (5) 

=  Ac  +  «  (6) 

where  x^Qe,  Jind  x^  are  as  before,  but  Ac  and  A+  are 
an  approximation  to  the  inverse  of  the  Hessian  at  Xe 
and  x+,  respectively,  eind  u  and  v  are  chosen  so  that 
A+  satisfies  the  inverse  secant  equation: 

*+  -  =  A+Vc.  (7) 

Note  that  this  is  exactly  the  secant  method  when  Ac  = 
and  A+  =  .  If  we  take 

U  =  Sc-  AcVc, 


and 


AJsc 

sjAcSc 


(GBU) 


then  we  obtain  the  inverse  Broyden  method’s  update 
strategy 


>1+ 


Ac  +  (Sc  -  AcVc) 


A'![sc 

sjAcSc' 


(14) 


For  computational  purposes,  there  are  some  sim¬ 
plifications  in  (14).  First,  compute 


Sc  —  AcDc  —  X^  —  (Xc  —  Acffc)  ~  Ac3+ 
=  Acg+. 


Furthermore,  we  can  rewrite 


AcVe  =  Acg+  -  Acgc  =  Ac?+  4-  ««• 

Thus,  we  arrive  at  the  serial  implementation  of  the 
inverse  Broyden  method. 


ALGORITHM 

The  Serial  Secant  Method  Using  (GBU) 
Assume  that  xt,  Ajt,  and  gt  are  given. 

(1)  Compute  s  =  —Akgii 

(2)  Compute  Xi+i  =  Xj  —  s 

(3)  Evaluate  gt+i 

(4)  Compute  u  =  Atgk+i 

(5)  Compute  Ak+\  =  A*  + 

(6)  If  not  converged 

then  increment  k  and  go  to  Step  (1). 
else  set  x,  =  x* . 


Parallel  Broyden’s  Method 

Again  let  /  :  12  C  R”  — »  R  be  a  convex  func¬ 
tion  with  a  local  minimum  at  x, .  We  know  that  there 
is  a  neighborhood  A/’(x.)  C  12  of  the  minimizer  such 


that  quasi-Newton  methods  based  on  a  secant  approx¬ 
imation  provide  superlinear  convergence  on  A^(x»). 
The  idea  is  to  develop  a  secant  method  which  can  si¬ 
multaneously  work  in  each  of  n  orthogonal  directions 
and  still  maintain  superlinear  convergence. 

If  we  let  {6j }  be  an  orthonormal  basis  for  R",  we 
can  write  x+  =  =  E ffc  =  E  - 

Premultiplying  (5)  by  bf  we  get 

i 

where  Ac  =  [ajiol, .. .  ,a'],  so  that  aj  is  the  j*''  col¬ 
umn  of  Ac. 

The  next  question  is  how  to  obtain  A+  a  non¬ 
singular  approximation  to  [V^/(x+)]~^.  We  seek  a 
rank-one  update  of  the  form  (6).  Since  this  update 
must  satisfy  the  inverse  secant  equation  (7),  we  sub¬ 
stitute  (6)  into  (7)  to  get 

x.k-Xc  =  {Ac  -I-  uv^)(g+  -  gc) 

=  Ac(g+  -  gc)  +  v^(g+  -  gc)it 
and  assuming  that  v^{g+  —  y^)  ^  0  we  have 

-  (»+  -  ^c)  -  Ac{g+  -  gc) 
vTig^-gc) 

Note  that  we  are  free  to  choose  the  update  vector  v. 
Then,  using  the  u  from  (9),  we  can  satisfy  (7)  with  the 
rank  one  update  in  (6).  We  shall  combine  (9)  and  (6) 
to  yield 

^>^{9+  -  9c) 

Introducing  the  usual  notation,  s  =  x+  -  Xe  and  y  = 
g+  —  gc,  and  assuming  v^{g+  —  gc)  =  Ij  the  matrix 
update  becomes 

A+ =  Ac  4- (s  -  AcJ/)u’’.  (10’) 

Gerber  and  Luk  [4]  discuss  generalized  secant 
methods  for  the  solution  of  linear  systems.  By  consid¬ 
ering  the  linear  system  as  the  gradient  equation,  their 
results  can  be  applied  to  minimization  of  quadratic 
functions.  They  prove  that  the  method  generated  by 
(5)  and  update  (9)  has  superlinear  convergence  in  a 
neighborhood  of  the  minimizer  provided  that  v  is  cho¬ 
sen  to  satisfy: 

0  ^>^{9+  -i7c)  =  1 

ii)  V  =  AJ w,  for  some  w  such  that 

in)  w'^(x+  -  Xc)  ^  0 


As  an  example,  the  “good”  Broyden  update  vector 


V 


AJ's 

s^Acy 


satisfies  the  three  conditions  above. 


(GBU) 


264 


Now,  we  wish  to  view  ;=  bj af  ss  a,  scalar,  so 
that  (8)  becomes 


In  order  to  implement  this  iteration  on  a  parallel  ma^ 
chine,  we  only  need  the  n  elements 
and  not  the  entire  matrix  Ag.  Similarly,  we  can  obtain 
an  update  strategy  for  by 


=  Pij  +  uibju 


where  t;  =  [i/i,  1/2,  •  ■  • ,  Vn\  and  «  is  defined  by  (9).  Ap¬ 
plying  (GBU),  let  us  examine  the  term  more  closely. 
First,  note  that  s  =  1+  - =  YLUt  ~  There¬ 
fore, 


Furthermore, 


ALGORITHM 

The  Parallel  Secant  Method  Using  (GBU) 

(1)  Set  =  bja^  where  Aq  =  [a?, .  • . ,  o°] 

for  each  i,j  =  1, . . .  ,n. 

(2)  Compute  4?,  •  •  •  scalars  such  that  Iq  = 

ie.  =  bfxo. 

(3)  Do  in  parallel  p  =  1, . . . ,  n 

(A)  Set  jb  =  0. 

(B)  Evaluate  7*  =  [fli(it)]p. 

(C)  vmj:oncat(7f,...,7‘). 

(D)  Repeat 

(a) Setf‘+i=^‘-^,7,‘^.';,. 

(b)  vmxoncat(^f 

(c)  Evaluate  7*+*  =  [fl(a:i+i)]p. 

(d)  vmxoncat(7f+' , . . .  ,7‘''‘M- 

(e)  Set  Tp  =  -  T*)/^/p- 

(f)  vmj:oncat(Ti , . . . ,  r„). 

(g)  For  i  =  1,. . .  ,n 

(ii)  Set  -  Tp) 

(h)  Set  ib  =  ib-l- 1. 

(E)  Until  stopping  criteria  satisfied. 

(F)  Return  *t+i  to  host  as  X final  ■ 

(4)  Set  X.  = 

As  a  note,  the  vm.concat  communication  proce¬ 
dure  allows  data  sharing  over  all  of  the  nodes  in  the  al¬ 
located  subcube  in  logarithmic  time.  (See  [3]  for  more 
information  on  the  vm_concat  procedure.)  The  stop¬ 
ping  criteria  to  be  used  can  be  studied  as  a  subject  area 
unto  itself  As  an  example,  ||xt+i  —  xt|l  <  f  should  be 
included  in  the  stopping  criteria,  where  e  >  0  is  the 
allowable  error  in  the  solution. 

Numerical  Results 


Putting  these  two  formulas  together  yields  a  formula 
for  the  (GBU)  v: 


If  we  combine  all  of  the  above  into  one  procedure, 
we  produce  the  following  algorithm. 


The  initial  implementation  of  the  above  algo¬ 
rithms  were  done  in  the  C  programming  leinguage  on 
the  1024  node  NCUBE/Ten  hypercube.  In  the  parallel 
implementation,  the  standard  basis  for  R"  was  used, 
and  several  choices  for  Aq  were  tried.  The  results  from 
the  initial  implementation  show  that  the  algorithm  is 
effective  on  convex  functions.  As  an  example,  the  ta¬ 
bles  below  summarize  the  results  for  the  quadratic 

E  - 1)' 

»=1 ;=1  i=l 

using  the  twos  vector  as  a  starting  guess,  i.e.  zq  = 
(2, 2,..., 2).  The  secant  algorithm  was  stopped  if 
!!?+li  <  0.000001,  ||x+  -  Xcll  <  0.000001,  or  the  num¬ 
ber  of  iterations  exceeded  500.  The  first  column  d 


indicates  the  dimension  of  the  cube  used.  It  is  impor¬ 
tant  to  note  that  the  timings  listed  herein  include  the 
time  to  load  data  to  the  nodes  of  the  allocated  cube. 

For  the  first  table,  we  will  approximate  the  initial 
Hessian  by  the  n  x  n  identity  matrix.  This  approxima¬ 
tion  has  the  advantage  of  being  easily  computed  and 
positive  definite. 


Table  1  :  Quadratic,  n  =  32 


Table  1  :  Quadratic,  n  =  64 


Table  1  :  Quadratic,  n  =  128 


Table  1  :  Quadratic,  n  =  256 


Table  1  ;  Quadratic,  n  =  512 


For  Table  2,  we  will  use  a  scaled  version  of  the 
Identity  matrix  as  the  initial  approximation  of  the  Hes¬ 
sian.  This  approximation  has  the  advantage  of  being 
easy  to  compute  and  is  positive  definite. 


Table  2  :  Quadratic,  n  =  32 


T&ble  2  :  Quadratic,  n  =  64 


Using  j4o  =  ]y^o7T  scaled  identity 


d 

Iterations 

Time(sec) 

f(x,) 

0 

2 

42.343 

O.OOe+00 

1 

2 

21.611 

O.OOe+00 

2 

2 

11.319 

O.OOe+00 

3 

2 

6.214 

Q.QOe+OO 

4 

2 

3.755 

O.OOe+00 

5 

2 

2.639 

O.OOe+00 

6 

2 

2.380 

O.OOe-t-00 

Table  2  :  Quadratic,  n  =  512 


1T.:»  A _ t 

d 

Iterations 

Tune(sec) 

/(*.) 

4 

2 

1289.277 

1.23e-29 

5 

2 

650.342 

1.23e-29 

6 

2 

330.784 

1.23e-29 

7 

2 

171.689 

1.23e-29 

8 

2 

98.123 

1.23e-29 

9 

2 

62.651 

1.23e-29 

Table  2  :  Quadratic,  n  =  128 


IT.:-..  A-  —  1 

d 

Iterations 

Time(sec) 

fM 

0 

2 

325.117 

O.OOe+00 

1 

2 

163.438 

O.OOe+00 

2 

2 

82.644 

O.OOe+00 

3 

2 

42.833 

O.OOe+00 

4 

2 

22.044 

O.OOe+00 

5 

2 

12.076 

O.OOe+00 

6 

2 

7.329 

O.OOe-1-00 

7 

2 

5.348 

O.OOe-t-00 

FinaJly,  we  will  invert  a  finite  difference  approx¬ 
imation  of  the  initial  Hessian.  This  approximation  is 
very  close  to  the  actual  inverse  of  the  initial  Hessian, 
unless  the  matrix  is  poorly  conditioned,  and  hence 
should  produce  more  accurate  results  than  the  other 
two  approximations.  However,  performing  the  finite 
differences  and  the  matrix  inversion  is  slow,  requiring 
C?(n®)  operations,  thereby  increasing  the  start-up  time 
on  the  host  program.  Due  to  the  method  of  measuring 
time  (only  the  time  used  on  the  nodes  is  counted),  the 
increased  start-up  time  is  not  reflected  in  the  table. 


Table  2  :  Quadratic,  n  =  256 


Using  Ao  =  the  scaled  identity 

d 

Iterations 

Time(sec) 

/M 

2 

2 

646.008 

1.57e-27 

3 

2 

325.271 

1.57e-27 

4 

2 

164.579 

1.57e-27 

5 

2 

84.471 

1.57e-27 

6 

2 

44.629 

1.57e-27 

7 

2 

25.131 

1.57e-27 

8 

2 

16.257 

l,57e-27 

Table  3  ;  Quadratic,  n  =  32 


Using  Ao  =  the  inverted  Hessian 

d 

Iterations 

Tiine(sec) 

/(*.) 

0 

2 

6.060 

O.OOe-t-OO 

1 

2 

3.233 

O.OOe-hOO 

2 

2 

1.923 

O.OOe-hOO 

3 

2 

1.270 

O.OOe-hOO 

4 

2 

1.018 

O.OOe+OO 

5 

2 

1.039 

O.OOe-hOO 
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Table  3  :  Qucidiatic,  n  =  64 


Table  3  ;  Quadratic,  n  =  512 


Using  Ao  =  lV^/(a:o)|  *>  the  inverted  Hessian 

d 

Iterations 

Time(8ec) 

/(X,) 

0 

2 

44.473 

O.OOe-bOO 

1 

2 

23.746 

O.OOe-f-00 

2 

2 

11.941 

O.OOe-(-00 

3 

2 

6.321 

O.OOe+00 

4 

2 

4.594 

O.OOe-l-00 

5 

2 

3.352 

O.OOe-bOO 

6 

2 

2.673 

O.OOe-t-00 

Table  3  :  Quadratic,  n  =  128 


Using  Ao  =  |V^/(xo)|“\  the  inverted  Hessian 

d 

Iterations 

Time(sec) 

/(X.) 

0 

2 

332.325 

O.OOe-t-00 

1 

2 

167.036 

O.OOe+00 

2 

2 

84.383 

O.OOe+00 

3 

2 

43.223 

O.OOe+OO 

4 

2 

22.542 

O.OOe-f-00 

5 

2 

12.387 

O.OOe-f-00 

6 

2 

7.739 

O.OOe-bOO 

7 

2 

5.417 

O.OOe-bOO 

Table  3  :  Quadratic,  n  =  256 


Using  Aq  =  |V^/(*o)|  the  inverted  Hessian 

d 

Iterations 

Tinie(sec) 

/(x.) 

2 

2 

657.612 

O.OOe-bOO 

3 

2 

330.778 

O.OOe-J-00 

4 

2 

167.724 

O.OOe-bOO 

5 

2 

85.927 

O.OOe-bOO 

6 

2 

45.366 

O.OOe-l-OO 

7 

2 

25.513 

O.OOe-1-00 

8 

2 

16.523 

O.OOe-t-OO 

Using  Ao  =  |V^/(zo)|  the  inverted  Hessian 

d 

Iterations 

Time(sec) 

/(X.) 

4 

2 

1315.607 

O.OOe-fOO 

5 

2 

661.409 

O.OOe-bOO 

6 

2 

336.369 

O.OOe-bOO 

7 

2 

174.282 

O.OOe+00 

8 

2 

94.106 

O.OOe-bOO 

9 

2 

55.777 

O.OOe-bOO 

As  a  second  example,  the  parallel  algorithm  was 
tested  on  a  rather  complicated  exponential  function 

Note  that  this  function  takes  it  minimum  at  the  origin. 
The  initial  guess  was  the  ones  vector  (1,1,...,!),  and 
again  the  three  initial  Hessian  approximations  were 
the  identity  matrix,  the  scaled  identity  matrix,  and 
the  inverse  of  the  finite  difference  approximation  to  the 
Hessian  at  the  initial  guess.  The  results  for  n  =  128 
appear  below  in  Table  4.  The  first  table  represents 
using  the  identity  as  the  initial  Hessian  approximation. 


Table  4  ;  Exponential,  n  =  128 


Using  Ao  =  /,  the  identity  matrix 

d 

Iterations 

Time(sec) 

/(X.) 

0 

9 

1513.322 

3.83e-20 

1 

9 

758.742 

3.83e-20 

2 

9 

380.269 

3.83e-20 

3 

9 

191.157 

3.83e-20 

4 

9 

96.840 

3.83e-20 

5 

9 

49.821 

3.83e-20 

6 

9 

26.654 

3.83e-20 

7 

9 

15.807 

3.83e-20 

The  next  table  represents  the  same  problem  with 
the  same  initial  guess,  but  substituting  the  scaled  iden¬ 
tity  for  the  identity  as  the  initial  Hessian  approxima¬ 
tion. 
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Table  4  :  Exponential,  n  =  128 


Using  Aq  =  (/(xo)|“^/,  the  scaled  identity 

d 

Iterations 

Time(sec) 

fix,) 

0 

9 

1511.218 

6.92e-17 

1 

9 

757.371 

6.92e-17 

2 

9 

379.557 

6.92e-17 

3 

9 

190.966 

6.92e-17 

4 

9 

97.919 

6.92e-17 

5 

9 

51.074 

6.92e-17 

6 

9 

28.883 

6.92e-17 

7 

9 

16.774 

6.92e-17 

The  final  table  represents  the  same  problem  but 
using  the  computed  inverse  of  the  finite  difference  ap¬ 
proximation  to  the  Hessian  at  the  initial  guess. 


approximating  Ao)  is  not  counted,  but  that  the  start¬ 
up  cost  incurred  by  a  node  (receiving  Xo,ffo  and  the 
appropriate  section  of  Aq)  is  counted  in  the  execution 
time.  The  efficiencies  presented  in  the  table  below  are 
surprisingly  very  high  when  the  gradient  is  expensive 
to  evaluate  in  relation  to  communication  time.  This 
is  the  case  for  larger  values  of  n.  Due  to  memory 
restrictions  on  the  nodes  (roughly  400  kilobytes  of  user 
space  is  available),  running  the  n  =  256  quadratic  was 
not  possible  on  a  cube  of  dimension  0  or  1 .  Rather  than 
extrapolate  the  timings  for  the  n  =  256  problem  on  1 
or  2  nodes,  the  efficiency  ratings  for  that  problem  have 
been  omitted.  Similarly,  there  are  no  efficiency  results 
for  the  n  =  512  problem.  The  rest  of  the  efficiency 
ratings  for  the  quadratic  are  listed  in  Table  5  below. 
Missing  values  are  indicated  by  two  asterisks  in  the 
field.  (These  are  cases  when  the  number  of  nodes  used 
would  exceed  the  dimension  of  the  problem  forcing  idle 
time,  or  unnecessary  replication  of  work.) 


Table  4  :  Exponential,  n  =  128 


Using  Ao  =  |V^/(xo)|~*,  the  inverted  Hessian 

d 

Iterations 

Time(sec) 

fix.) 

0 

8 

1348.612 

6.92e-17 

1 

8 

675.449 

6.92e-17 

2 

8 

346.618 

6.92e-17 

3 

8 

168.983 

6.92e-17 

4 

8 

85.201 

6.92e-17 

5 

8 

43.370 

6.92e-17 

6 

8 

22.567 

6.92e-17 

7 

8 

12.407 

6.92e-17 

From  the  timing  results,  an  efficiency  rating  can 
be  determined  for  each  run  of  the  algorithm.  This 
rating  is  a  measure  of  how  much  parallelism  can  be 
exploited  by  using  more  nodes  on  the  problem.  The 
higher  the  efficiency  rating,  the  more  concurrent  work 
is  being  performed.  Large  amounts  of  communication 
time  lowers  the  efficiency  rating  considerably.  The  ef¬ 
ficiency  is  taken  to  be  the  execution  time  of  the  algo¬ 
rithm  relative  to  the  number  of  nodes  used,  ie. 


Table  5  ;  Efficiency,  Quadratic 


Using  Ao  =  |V^/(xo)|~\  the  inverted  Hessian 

d 

32 

64 

128 

0 

1.0000 

1.0000 

1.0000 

1 

0.9370 

0.9364 

0.9948 

2 

0.7878 

0.9311 

0.9846 

3 

0.5964 

0.8794 

0.9611 

4 

0.3719 

0.6049 

0.9214 

5 

0.1821 

0.4145 

0.8383 

6 

0.2599 

0.6709 

7 

** 

0.4792 

^  .  timei 

efficiency  =  - : - 

p  ♦  timcp 

where  fj'mei  is  the  execution  time  of  the  algorithm  on  1 
node,  and  timcp  is  the  execution  time  of  the  algorithm 
using  p  nodes.  Note  that  the  start-up  cost  incurred 
by  the  host  processor  (reading  xq,  evaluating  go,  and 


The  results  of  the  efficiencies  for  quadratic  func¬ 
tion  above  with  n  =  128  are  summarized  in  the  three 
figures  below.  The  similarities  of  the  efficiencies  be¬ 
tween  any  two  of  the  three  cases  precludes  their  com¬ 
bined  presentation  in  one  plot. 
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Figure  1 
Quadratic 

n  =  128,ylo  =  (V2/(xo))-* 


Figure  2 
Quadratic 
n  =  128,  j4o  =  / 


Figure  3 
Quadratic 

n  =  128,>lo  =  |7^/ 


Dimeitsion  of  cube 


Finally,  for  comparative  purposes,  the  efficiencies 
for  the  exponential  function  with  n  =  128  are  shown  in 
the  next  three  figures  (again,  one  for  each  of  the  three 
methods  for  choosing  the  initial  Hessian  approxima^ 
tion).  Note  that  these  three  plots  appear  very  similar 
to  the  previous  plots.  In  fact,  the  plots  from  all  of 
the  test  cases  used  to  test  the  algorithm  take  the  same 
form.  The  small  dip  in  Figure  4  can  be  attributed  to 
the  inaccuraw:y  of  taking  measurements  while  multiple 
jobs  were  running  on  the  NCUBE. 

The  high  efficiencies  correspond  to  ratings  of  be¬ 
tween  80.0  and  00.0  kiloflops  on  each  of  the  128  nodes. 
This  agrees  with  the  findings  of  Gustafson,  Montry, 
and  Benner  [5]  that  sustained  performance  is  between 
70  and  130  megaflops  for  the  NCUBE/Ten  (68.4  to 
127.0  kiloflops  on  a  single  node,  double  precision). 


Figure  4 
Exponential 

n  =  128,Ao  =  (VV(*o))-‘ 
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Figure  5 
Exponential 

n=  128,>lo  =  / 


Figure  6 
Exponential 

n  =  128Mo  =  j7^/ 


There  were  several  other  test  functions  used 
in  evaluating  the  performance  of  the  parallel  secant 
method,  but  those  that  did  not  diverge  did  not  provide 
as  clear  an  illustration  of  the  performance  character¬ 
istics  of  the  method.  Hence,  they  have  been  omitted. 

Conclusions 

These  results  suggest  that  the  method  is  efficient 
in  the  minimization  of  a  multivariate  nonlinear  func¬ 
tion,  and  that  while  the  number  of  iterations  requirea 
for  a  solution,  and  the  time  required  to  obtain  that  so¬ 
lution,  are  dependent  upon  the  initial  approximation 
for  the  inverse  of  the  Hessian,  the  extra  accuracy  and 
speed  gained  by  using  the  finite  difference  approach 


does  not  justify  the  additional  start-up  penalty  in¬ 
curred.  The  method  works  well  for  convex  functions, 
but  is  subject  to  the  usual  limitations  of  an  inverse 
secant  method:  bad  initial  approximations  or  noisy 
functions  can  cause  the  iterates  to  traverse  infeasi¬ 
ble  regions,  or  to  diverge.  force  com  gence  in 
these  cases,  trust  regions,  lin  searches,  or  backtrack¬ 
ing  techniques  must  be  added.  By  adding  a  “correc¬ 
tive”  technique,  the  parallel  secant  method  will  pro¬ 
vide  a  very  good  alternative  to  other  minimization 
methods,  particularly  when  the  problem  size  increases, 
and  function  evaluation  becomes  more  costly. 
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Abstract 

This  paper  describes  the  implementation  of  a 
parallel  Levenberg-Marquardt  algorithm  on  an  iPSC/2. 
The  Levenberg-Marquardt  algorithm  is  a  standard 
technique  for  non-linear  least-squares  optimization.  Ftv 
a  problem  with  D  data  points  and  P  parameters  to  be 
estimated,  each  iteration  requires  that  the  objective 
function  and  its  P  partials  be  evaluated  at  all  D  data 
points,  using  the  current  parameter  estimates.  Each 
iteration  also  requires  the  solution  of  a  PxP  linear 
system  to  obtain  the  next  set  of  parameter  estimates.  A 
simple  data-parallel  decomposition  is  used  where  the 
data  is  evenly  distributed  across  the  nodes  to  parallelize 
the  evaluations  of  the  objective  function  and  its  partial 
derivatives.  The  performance  of  the  method  is 
characterized  versus  the  number  of  nodes,  the  number  of 
data  points,  and  the  number  of  parameters  in  the 
objective  function.  Further  enhancements  are  also 
discussed. 

Introduction 

Many  problems  can  be  cast  as  the  search  for  a  set 
of  parameters  that  minimize  (or  maximize)  some 
function.  Such  a  problem  is  known  as  a  minimization 
or  optimization  problem.  The  function  to  be  minimized 
is  known  as  the  objective  function.  Classes  of 
algorithms  exist  for  cases  where  the  objective  function 
is  linear  vs.  non-linear,  univariate  vs.  multivariate,  and 
where  its  derivatives  with  respect  to  the  adjustable 
parameters  are  known  vs.  unknown. 

The  Levenbo’g-Marquardt  (LM)  algorithm  [1,2]  is  a 
standard  technique  for  non-linear  least-squares, 
multivariate  optimization  when  the  partial  derivatives  of 
the  objective  function  are  known  and  are  not  too 
inconvenient  to  compute.  For  a  problem  with  D  data 
points  and  P  parameto^  to  be  estimated,  each  iteration 
requires  that  the  objective  function  and  its  P  partials  be 
evaluated  at  all  D  data  points  using  the  current 
parameter  estimates.  Each  iteration  also  requires  the 
solution  of  a  PxP  linear  system  to  obtain  the  next  set 
of  parameter  estimates.  P  is  almost  always  much  less 
than  D. 

This  paper  describes  the  implementation  of  a 
parallel  LM  algorithm  on  an  iPSC/2  SX  (Weitek 
FPUs).  The  data  is  evenly  distributed  across  the  nodes 
to  pa^lelize  the  evaluations  of  the  objective  function 
and  its  partial  derivatives.  Since  the  computation  of  the 
objective  function  and  its  partials  at  one  data  point  is 


independent  of  the  computations  at  the  other  data 
points,  this  phase  of  the  problem  is  perfectly  parallel. 
Currently,  the  linear  system  solution  is  carried  out  on 
one  node. 

Levenberg-Marquardt  Technique 

In  this  section  we  will  discuss  the  LM  algorithm  in 
enough  detail  to  see  how  it  is  parallelized.  A  full 
derivation  can  be  found  in  [1],  [2],  or  almost  any  book 
on  non-linear  optimization.  The  presentation  below 
follows  [2]. 

The  goal  of  the  LM  algorithm  is  to  minimize  the 
objective  function: 

N 

z^(p)=X  *  y  p) )  ^ 

i=l 

by  iteratively  adjusting  the  vector  of  parameters,  p.  The 
Xi  are  the  independent  variables  of  the  data,  the  yj  are 
the  measured  dependent  values,  and  the  function  y(xi,  p) 
predicts  yi  given  xj  and  the  current  parameter  estimates. 

The  components  of  the  gradient  vector  and  second 
derivative  matrix  of  are; 


We  will  remove  the  factors  of  two  by  defining  the 
components  of  the  vector  b  and  matrix  A  as 

If  can  be  accurately  approximated  by  a  quadratic 
surface,  then  the  collection  dp  (which  when  added  to  the 
current  parameter  estimates  p  gives  the  parameters  p^in 
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that  minimize  x^)  can  be  found  in  a  single  step  by 
solving  the  linear  system: 

A  dp  =  b  (6) 

Note  that  A  is  a  function  of  both  the  first  and  second 
derivatives  of  x^-  second  derivatives  will  be 
destabilizing  when  the  quadratic  is  not  a  good 
approximation.  We  will  therciore  neglect  these  terms 
and  redefine  A  as; 


This  change  improves  stability  and  lessens  the 
computational  complexity  without  seriously  degrading 
performance  of  the  method. 

If  the  quadratic  approximation  is  a  bad  assumption, 
the  step  of  (6)  will  probably  cause  x^  to  increase  rather 
than  decrease.  In  this  case,  about  the  best  we  can  do  is 
take  a  step  down  the  gradient: 

ap  =  t  b,  (8) 

where  t  is  a  constant  that  sets  the  size  of  the  step.  Even 
if  the  quadratic  model  is  only  a  fair  approximation,  A 
provides  some  information  about  the  size  of  t. 
Marquardt  defined  a  new  matrix  A': 

A’  »  A  +  XI  (9) 

When  X  is  small,  our  update  is  essentially  that  of  (6). 
When  X  is  large.  A*  becomes  diagonally  dominant,  and 
our  step  approaches  an  infinitesimal  step  down  the 
gradient.  As  long  as  our  steps  are  succeeding,  we  can 
assume  the  quadratic  approximation  to  x^  is  accurate, 
and  can  decrease  X  to  achieve  faster  convergence.  If  a 
proposed  step  fails,  we  increase  X  and  try  again.  This  is 
expressed  below  in  Algorithm  1. 

Algorithm  1 

Serial  Levenbeig-Marquardt  Technique 

0  input  initial  p,  data  x,  y,  set  X  =  0.001 

1  compute  X^(P)'  A 

2  if  II  b  II  <  tol  or  iteration  limit  reached, 

done 

3  solve  (A  +  XI)  dp  =  b  for  3p 

4  compute  x^(P  +  3p),  btinp<  Amip 

5  if  X^(P  +  3P)  2  X^(P) 

X=X*10 

else 

X  z  X  /  10,  p  =  p  +  dp, 
b  =  btmp*  A  =  Atmp 

6  go  to  2 


Parallel  Decomposition 

The  LM  algnithm  is  quite  easy  to  parallelize  using 
a  data-parallel  decomposition.  At  the  ^ginning  of  the 
process,  the  x  and  y  data  is  evenly  distributed  among  the 
nodes.  At  the  start  of  each  iteration,  the  cunent 
parameter  estimates  are  broadcast  to  all  the  nodes.  Each 
node  then  computes  x^>  h,  and  A  for  its  portion  of  the 
data.  These  are  summed  and  node  0  receives  the  total  x^. 
b,  and  A.  It  solves  the  linear  system  to  determine  the 
new  vector  of  parameter  estimates.  This  is  expressed 
below  in  Algorithm  2. 

Algorithm  2 

Parallel  Levenberg-Marquardt  Technique 

0  distribute  x,  y  date  evenly  among  nodes, 

broadcast  initial  p  to  all  nodes, 

set  X  =  0.001 

1  compute  X^(P)>  b,  A  in  parallel 

2  if  II  b  II  <  tol  or  iteration  limit  reached, 

done 

3  solve  (A  XI)  dp  =  b  for  dp 

4  broadcast  p  -t-  dp  to  all  nodes 

5  compute  x^(P  +  dp),  btmp.  Atmp  m  parallel 

6  if  X^(P  +  Sp)  ^  X^(P) 

X  =  X*  10 

else 

X  =  X/10 

p  =  p  +  dp 

b  =  btmp,  A  =  Atmp 

7  go  to  2 

It  would  also  be  possible  to  parallelize  the  linear 
system  solution  in  step  3.  The  advantages  and 
disadvantages  of  this  are  discussed  in  the  conclusions 
section. 

Once  the  data  has  been  distributed,  the  only 
communication  that  occurs  is  the  broadcast  of  the  new 
parameter  estimates  and  the  summation  of  x^.  b,  and 
A.  This  communication  does  not  increase  as  the 
number  of  data  points  increases.  The  com.  iunication 
does  increase  as  the  log  of  the  number  of  nodes  and  the 
square  of  the  number  of  parameters. 

Performance 

Several  factors  make  it  difficult  to  characterize  the 
performance  of  the  LM  algorithm.  First,  it  is  an 
iterative  algorithm,  so  the  possibility  exists  that  as 
more  data  points  are  added,  enough  additional 
information  is  obtained  to  reduce  the  number  of 
iterations  needed  to  converge  to  a  set  of  parameter 
estimates.  For  this  reason  we  report  per-iteration  times. 
The  number  of  iteratitms  will  not  be  affected  by  the 
number  of  nodes  used. 
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Another  problem  is  that  the  measures  of  speedup 
and  efficiency  are  influenced  by  the  complexity  of  the 
objective  function  calculations  relative  to  the  linear 
system  solution.  For  example,  consider  the  case  where 
the  evaluations  of  the  objective  and  its  partials  take  very 
little  time.  The  computation  time  wiU  be  dominated  by 
that  of  the  linear  system  solution,  and  will  show  no 
dependence  upon  the  number  of  data  points.  Further, 
performance  would  be  expected  to  degr^  as  additional 
nodes  are  added,  due  to  the  increased  communication 
costs.  The  opposite  case,  where  the  evaluation  of  the 
objective  function  and  its  partials  is  quite  lengthy, 
would  exhibit  almost  perfectly  linear  sp^up.  Neither 
of  these  results  would  accurately  predict  the  p^oimance 
for  objective  functions  of  mt^rate  complexity.  Our 
objective  function  should  also  allow  us  to  characterize 
performance  as  the  numbo’  of  parameters  increases.  To 
accommodate  both  these  requirements,  we  will  use  an 
objective  function  that  models  the  input  data  as  the  sum 
of  K  Gaussians.  By  inaeasing  K,  we  can  increase  the 
ccmpuudonal  complexity  of  the  objective  function. 
Ejch  Gaussian  is  characterized  by  three  parameters,  its 
J;x:ation  its  amplitude  ak,  and  its  spread,  Sk.  The 
objective  function  is  then 

D 

x2  =  X(yi-y(xi,p))2  (10) 

i=l 

where  x^  is  the  independent  variable(s)  of  the  i'th  data 
point,  yi  is  the  dependent  value,  and  y(xi,p)  is  the 
function  value  predicted  from  the  current  set  of 
parameters,  p: 


K 

y  (xi.  p)  =  X 

K  is  the  number  of  Gaussians,  thus  the  number  of 
parameters  in  the  model  is  P  =  3K.  An  example  of  the 
input  data  and  the  underlying  Gaussians  is  shown  below 
in  figure  1  for  the  case  K=3. 

Figure  2a  shows  the  time  for  a  single  iteration, 
including  the  linear  system  solution,  versus  the  number 
of  nodes  and  number  of  data  points  for  the  case  of  4 
Gaussians  (12  parameters).  Figure  2b  shows  the  time 
for  a  single  iteration  versus  the  number  of  parameters 
and  number  of  nodes  for  the  case  of  S0(X)  data  points. 
These  times  do  not  include  the  time  to  read  the  data  into 
the  nodes.  The  decision  not  to  include  VO  times  was 
made  because  I/O  times  are  highly  dependent  upon  the 
presence  or  absence  of  the  iPSCV2  Concurrent  I/O 
hardware. 
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Figure  2a:  Time  /  Iteration  vs.  Number  of  Nodes  and 
Number  of  Data  Points,  12  Parameters 


Figure  2b:  Time  /  Iteration  vs.  Number  of  Nodes  and 
Number  of  Parameters ,  5000  Data  Points 


Conclusions 

Figure  2a  shows  that  for  small  data  sets,  the 
increase  in  communication  as  the  number  of  nodes 
increases  is  not  made  up  for  by  the  reduction  in  time  by 
evaluating  the  objective  function  and  its  partials  in 
parallel.  In  fact,  we  see  a  slowdown  as  the  number  of 
nodes  increases.  As  can  be  expected,  this  problem  is 
reduced  when  the  data  sets  become  larger,  as  shown  in 
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figure  2a.  The  same  problem  is  seen  for  small  numbers 
of  parameters,  as  shown  in  Hgure  2b. 

This  problem  has  two  causes.  Most  of  the  nodes 
are  sitting  idle  while  one  node  performs  the  linear 
system  solution.  This  idle  time,  combined  with  the 
increased  communication  mentioned  above,  leads  to  the 
poor  speedups  noted  as  the  numbo*  of  nodes  increases. 

The  efficiency  of  the  algoithm  could  be  increased  if 
we  could  speed  the  linear  system  solution,  thus 
eliminating  the  time  where  all  nodes  but  one  are  idle. 
The  obvious  solution  is  to  parallelize  the  linear  system 
solution,  which  is  the  next  step  we  will  try.  However, 
its  benefits  are  doubtful.  For  small  numbers  of 
parameters  more  overhead  will  be  incurred  than  would 
be  compensated  for  by  parallel  execution.  In  other 
words,  it  will  meicly  aggravate  the  current  problem. 
This  will  be  alleviated  as  the  number  of  parameters 
increases,  but  reasonable  efficiencies  will  probably  not 
be  reached  until  there  are  about  10  times  as  many 
parameters  as  nodes  [3].  However,  for  problems  of  that 
size  it  is  time  to  see  if  an  algorithm  other  than  LM  can 
be  used.  LM  is  an  excellent  general  purpose  algorithm, 
but  for  such  large  numbers  of  parameters  it  is  better  to 
find  an  algorithm  that  exploits  special,  problem- 
dependent  characteristics  of  x^.  b,  or  A. 

References 

[1]  Scales,  L.E.  (1985)  Introduction  to  Non-Linear 
Optimization,  Springer- Verlag,  New  York. 

[2]  Press,  W.H.,  Flannery,  B.P.,  Teukolsky,  S.A.,  and 
Vetterling,  W.T.  (1988)  Numerical  Recipes  in  C, 
Cambridge  University  Press,  Cambridge. 

[3]  Juszczak,  J.W.  and  van  de  Geijn,  R.A.  (1989)  An 
Experiment  in  Coding  Portable  Parallel  Matrix 
Algorithms.  In  Proceedings  of  the  Fourth  Corf,  on 
Hypercubes,  Concurrent  Computers,  and 
Applications  (HCCA4),  Monterey,  CA,  April, 
675-680. 


275 


Parallelizing  Multiple  Linear  Regression 
for  Speed  and  Redundancy:  An  Empirical  Study^ 


Mingxian  Xu 
John  J.  Miller 
Edward  J.  Wegman 

Center  for  Computational  Statistics 
George  Mason  University 
Fairfax,  VA  22030 


ABSTRACT 

The  purpose  of  this  paper  is  to  present  a 
parallel  implementation  of  multiple  linear 
regression.  We  discuss  the  multiple  linear 
regression  model.  Traditionally  parallelism  has 
been  used  for  either  speed-up  or  redundancy 
(hence  reliability).  With  stochastic  data,  by 
clever  parsing  and  algorithm  development,  it  is 
possible  to  achieve  both  speed  and  reliability 
enhancement.  We  demonstrate  this  with 
multiple  linear  regression.  Other  examples 
include  kernel  estimation  and  bootstrapping. 

1.  Introduction 

Contemporary  statistical  computations 
often  focus  the  analysis  of  massive  data  sets 
with  complex  algorithms.  Consequently,  efforts 
to  speed  up  the  calculations  are  extremely 
important  even  with  the  impressive 
computational  power  available  today.  Parallel 
computation  techniques  are  an  important 
technology  that  may  be  used  to  achieve  speed¬ 
up  and,  indeed,  are  likely  to  become  even  more 
significant  as  the  physical  limits  of  conventional 
serial  architectures  are  reached.  Historically  in 
the  1960s  and  early  1970s,  parallelism  was  also 
used  in  the  design  of  both  hardware  and 
software  to  enhance  the  reliability  of  systems 
through  redundancy.  In  such  a  design. 


components  (either  hardware  or  software)  run  in 
parallel  performing  the  same  task.  The  parallel 
processors  each  process  the  same  data,  with  a 
voting  procedure  used  to  determine  the  reported 
outcome  of  the  computation.  The  object  of  the 
redundancy  in  this  case  is  fault  tolerance.  Of 
course,  this  type  of  parallelism  leads  to  no 
inherent  speed-up  in  the  computations. 

One  may  use  parallelism  in  achieving 
speed-up  by  sending  different  data  to  the 
different  processors.  This  can  result  in 
substantial  speed-up,  depending  on 

communication  overhead  and  the  details  of  the 
implementation  of  parallelism.  However,  in  this 
mode  of  operation  there  is  no  mechanism  for 
achieving  fault  detection.  For  example,  the 

decomposition  of  an  integral  and  assignment  of 
portions  of  that  integral  to  processors  in  a 
numerical  quadrature  algorithm  is  an  illustration 
of  this  sort  of  parallelism.  It  would  usually  be 
impossible  to  know  whether  one  processor 

returned  an  incorrect  value  for  its  portion  of  the 
integral.  An  interesting  question  then  is 
whether  or  not  there  are  situations  where  we 
can  use  parallelism  for  speed-up  and  still 
maintain  some  of  the  properties  of  redundancy 
for  our  reliability  checks.  In  fact,  the  thesis  of 
this  paper  is  that  this  is  possible  in  some 

situations,  as  will  be  described  below. 
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In  a  setting  in  which  data  may  be 
assumed  to  be  generated  from  some 
probabilistically  homogeneous  structure,  we  are 
suggesting  the  use  of  statistical  hypothesis  tests 
in  place  of  voting  procedures  to  compare  results 
from  different  nodes.  We  assign  data  to  nodes 
by  parsing  in  an  appropriate  manner.  As  a  first 
step,  we  assume  that  the  data  may  be  parsed 
into  random  samples.  Since  we  began  with 
stochastically  homogeneous  data  and  parsed  it 
into  random  samples,  the  only  variation  in  the 
output  that  we  should  expect  to  see  from  the 
different  nodes  is  stochastic  variation.  Hence, 
we  can  use  statistical  tests  to  check  the  results 
for  deviations  from  homogeneity.  These  tests 
yield  a  stochastic  measure  of  redundancy  for  our 
parallel  implementation.  We  can,  thus,  use  the 
tests  for  fault  detection  in  either  of  the  node 
hardware  or  software. 

To  illustrate  these  ideas,  we  have 
selected  multiple  linear  regression  as  an 
application.  We  parallelize  the  computations  for 
multiple  regression  and  then  use  the  results  from 
each  node  as  a  part  of  a  homogeneity  check. 
We  have  selected  multiple  regression  because  the 
procedure  is  well  understood,  the  computations 
are  straightforward,  and  the  statistical  tests  of 
homogeneity  are  easily  developed. 
Subsequently,  we  develop  other  applications 
such  as  kernel  density  estimation  and 
bootstrapping. 

Our  implementation  of  the  above 
described  parallel  techniques  will  take  place 
using  an  Intel  iPSC/2  (referred  to  in  this  paper 
as  the  hypercube).  Our  hypercube  is  configured 
with  16  nodes,  each  of  which  has  a  32-bit  80386 
CPU,  an  80387  math  co-processor,  a  direct 
routing  module  for  communication  via  message 
peissing  and  an  additional  vector  pipeline  co¬ 
processor.  In  addition  to  the  16  nodes,  there  is 
also  a  host  node  (referred  to  by  Intel  as  the 
system  resource  manager),  which  “directs”  the 
activity  of  the  other  nodes.  The  hypercube 
system  has  a  distributed,  message- passing 
architecture.  Data  passes  through  the  host  to 
the  nodes  and  the  results  are  gathered  from  the 
nodes  back  to  the  host.  Given  the  message¬ 
passing  nature  of  the  architecture, 
communication  overhead  typically  plays  a 


significant  part  in  the  overall  effectiveness  of 
any  algorithm.  In  general,  computational 
problems  which  require  comparatively  little 
internode  communication  are  the  most  effective 
ones  on  the  message  paissing  architectures. 
Bootstrapping  and  kernel  smoothing  operations 
are  examples  of  computationally  intensive  tasks 
which  fall  into  this  category.  While  multiple 
linear  regression  is  comparatively 
communications  intensive,  it  does  admit  a  very 
effective  parallel  implementation  and,  moreover, 
elegantly  illustrates  our  point  that  we  felt  it  was 
quite  worth  developing.  Also  it  allows  us  to 
investigate  the  effect  of  varying  the 
communication  packet  size  on  the  potential 
speed-up. 

2.  The  Multiple  Linear  Regression  Model 

We  often  have  the  need  to  study  a 
system  in  which  the  changes  in  several  variables 
may  effect  the  dependent  variable.  We  may 
know  or  be  willing  to  assume  that  the  model  is 
expressed  as  a  linear  model  or  we  may  use  a 
linear  model  as  an  approximation  to  some 
unknown,  more  complex  model.  In  either  case, 
leaist  squares  estimation  yields  a  computational 
technique  generally  known  as  regression.  We 
distribute  the  computations  necessary  for 
multiple  linear  regression  over  several  node 
processors.  We  then  use  statistical  tests  for 
homogeneity  as  a  redundancy  check  for 
hardware  and  software  faults.  The  tests  used  in 
this  discussion  depend  on  the  eissumption  of 
normally  distributed  residuals  for  their  complete 
validity  although,  of  course,  their  nonparametric 
analogues  may  also  be  used.  Because  our  use  of 
these  normal  tests  is  as  a  descriptive  statistic  to 
indicate  severe  deviations  from  homogeneity,  we 
are  not  extremely  concerned  whether  the 
assumption  of  normality  is  met  exactly  or  not. 

The  mathematical  model  for  multiple 
linear  regression  can  be  expressed  as  follows; 

y<  =  /?0  +  Pi^ii  +  02^i2  +  -  +  0p^ip  +  e.- 

i  =  1,  2,  ...  ,  n, 
or  in  matrix  formulation: 
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y  =  i/Jo  + 

where  y  is  an  (n  x  1)  vector  of  observations,  i 
is  an  X  1)  vector  of  ones,  is  an  unknown 
parameter,  X  is  an  (n  x  p)  matrix  of 
nonstochastic  variables,  is  a  (p  x  1)  vector 
of  unknown  parameters,  and  ;e  is  an  (n  x  1) 
vector  of  random  errors.  The  traditional 
assumptions  are  E(e)  =  £  and  Cov(e)  =  <r^i. 
Thus  E(y)  =  XP  and  Cov(y)  =  tr^i.  It  can  be 
shown  tfiat  the  least  squares  estimates  of  Pq  and 
/?!,  Po  and  Pi,  may  be  obtained  as  follows; 

£i  =  (  X'xr^x'y,  ^0  =  y  -  x'£,.  (1) 

where  x  =  X^l  /n  is  the  vector  of  column  means, 
X  =  X  —  1  x'  is  the  centered  X  matrix  so  that 
Xjj  =  x,j  —  Xj,  y  =  y  ^/n,  and  y  =  y  —  ^y. 
We  will  assume  that  in  the  model  we  work  with, 
we  have  a  nonsingular  X  X  matrix  in  each  place 
where  we  require  (X^X)“^.  We  may  also  obtain 
a  Sum  of  Squared  Errors,  SSE  =  y'y  —  y^Xp,. 

To  implement  multiple  linear  regression 
in  a  parallel  fashion,  we  partition  the  whole  set 
of  the  observations  and  variables  into  m  subsets 
of  close  to  equal  size.  We  then  send  each  of 
these  subsets  to  one  of  the  m  nodes.  We  denote 
data  sent  to,  computed  at,  or  received  from 
node  k  by  adding  a  subscript  (k)  to  the  item. 
Thus  we  send  to  node  k:  y^^j  and  X^^j  of  n* 
rows  each.  We  compute  at  node  k: 

X(i)’  y(*)»  (S(t)E(t))’  ^H(*)y(*))>  (yltiyi*))’ 

and  SSE^j^.  We  note  two  things  at  this  point: 
1)  We  center  at  each  node.  (We  use  a  one  pass 
recursive  centering  algorithm  for  speed  and 
accuracy.)  2)  We  do  not  compute  the  slopes 
and  intercepts  at  each  node,  although  we  could 
if  we  wished.  We  are  not  going  to  use  the  node 
estimates.  We  merely  wish  to  use  the 
information  returned  from  the  nodes  to  i) 
Compute  the  least  squares  estimates  for  all  the 
data  and  ii)  Assess  homogeneity  of  the  results 
for  fault  checking. 

In  order  to  proceed  with  our 
homogeneity  checks,  we  define  three  potential 
models  for  our  data.  Model  0  is  the  nominal 
model.  For  each  node  partition  of  the  data,  we 


assume  that  the  slope  vector  /Jj,  and  the 
intercept  Pq  are  the  same.  For  Tilodel  1,  we 
assume  that  the  node  partitions  have  the  same 
slope  vector,  but  different  intercepts.  For  Model 
2,  we  assume  that  the  node  partitions  have  both 
different  slope  vectors  and  different  intercepts. 
In  matrix  terms,  the  three  models  are  given  by: 

Model  0:  y  =  Ipo  +  +  £(*)• 

Model  l:y (i)  =  +  X(t)£i  +  £(t)- 

Model  2; 

y  (t)  =  i^o(t)  +  X(t)^yy  i(t)  +  S(k)- 

After  aggregating  at  the  host  the 
information  computed  at  the  nodes,  we  proceed 
to  compute  a  Sum  of  Squared  Errors  for  each  of 
the  three  models  as  follows; 

m 

Model  2:  SSEj  =  E  SSEf^^ 

i=l  ^  ’ 

Model  1:  (y'y)(i)  =  2(St)y(*)). 

(x'y)(i)  =  S/x[,)X(,)) 

tit  • 

(X'X)(,)  =  E/X|,)y(.)) 

SSE,  =  (y'y)(,)  -  (X'y )',)(X'X)-(X'y ),,, 
Model  0: 

n  =  E  n*.  X  =  E  ntX(4)/n,  y  =  E 

t=l  i  =  l  t  =  l 

=  (y'y)(i)  +  2 

(S'y)(o)  =  (E'y)(i)  +  2^  ni(X(i)  -  x)(y(t)  -  V) 


(S'X)(o)  =  (X'X)(,)  +  g  n*(X(*)  -  -  £)' 


SSEo  =  (y'y)(0)  -  (X'y )|o)(X'X)^„‘)(X'y)(o) 

We  calculate  the  regression  solutions  using 
equation  (1)  with  the  summary  statistics 
computed  for  Model  0.  The  degrees  of  freedom 
associated  with  the  Error  Sums  of  Squares  are 
given  by  dfo  =  n  —  p  —  1,  dfj  =  n  —  p  —  m, 
df2  =  n  —  mp  —  m. 

We  may  now  calculate  two  test 
statistics.  We  may  test  for  Total  Homogeneity 
(which  is  the  true  redundancy  test  for  fault 
checking)  and  for  Homogeneity  of  Slopes  Only 
(which  we  have  included  simply  because  it  is  so 
easy  to  do  and  might  provide  some  detail  about 
what  went  wrong  if  something  did).  The  test 
for  Total  Homogeneity  uses  the  statistics: 
^^(2,0)  ~  SSEq  SSE21  df^2,o)  — 

dfo  -  df2  =  (m-l)(p+l),  MS(2.o)  = 

SS(2,o)/<^f(2,0)i 

F^2,o)  ~  MS(2, where  MSE2  = 

SS  E2/df2 . 

The  test  for  Homogeneity  of  Slopes  Only  uses 
the  statistics:  SS^j  i)  —  SSEj  — SSE2,  df^j  1)  ~ 
dfj  •*“  df2 

=  (m  —  1  )(p-i-l),  MS^2,1)  —  ^(2,1) 

MS(2  ij/MSEj,  where  MSE2  =  SSE2/df2.  If 
heterogeneity  is  detected,  then  further  tests  may 
be  made  to  isolate  the  nodes  with  different  and 
presumed  faulty  results.  We  note  at  this  point 
that  we  set  the  significance  level  for  our 
homogeneity  test  very  small.  This  is  because  we 
want  a  very  small  false  alarm  rate.  We  only 
want  to  detect  egregious  deviations  from 
homogeneity,  as  might  be  caused  by  a  hardware 
or  software  failure. 

The  methodology  described  above  is 
designed  to  isolate  potentially  catcistrophic 
failures  in  the  node  hardware  or  software. 


However,  it  could  also  be  used  to  affect  a  speed¬ 
up  of  other  kinds  of  homogeneity  checks  on 
data.  For  example,  suppose  that  instead  of 
parsing  the  data  to  allocate  it  to  nodes  in  a 
manner  which  creates  random  samples,  we 
allocated  the  data  to  correspond  to  some 
meaningful  partition  of  the  data  such  as 
orthants  of  the  X  space.  The  parallel  algorithm 
described  above  would  then  yield  a  speed-up  of 
this  homogeneity  check,  but  would  no  longer 
have  any  fault  detection  capability.  However,  a 
simple  modification  whereby  we  split  each 
partition  into  two  or  more  subpartitions  via 
random  sampling  would  still  give  a  homogeneity 
check  using  straightforward  extensions  of  the 
above  methodology. 

3.  The  Timing  Results 

The  timing  study  is  designed  to  measure 
the  effectiveness  of  the  parallel  scheme  described 
above,  and,  in  particular,  to  measure  the  effect 
of  changing  the  size  of  the  communications 
packets  sent  between  the  host  and  the  nodes. 
We  began  by  generating  data  files  of  similar 
data.  We  did  this  by  taking  an  original  data  set 
with  six  independent  variables  and  then 
generating  data  sets  of  arbitrary  size.  We  then 
matched  the  covariance  structure  of  the  rows  of 
the  X  matrix  with  that  of  the  original  problem, 
made  the  regression  coefficients  the  same  as  in 
the  original  problem,  and  matched  the 
variability  of  the  generated  residuals  with  those 
of  the  original  problem.  Hence,  regardless  of  the 
size  of  the  test  data  set,  we  could  be  assured 
that  it  stochastically  agreed  with  the  original 
data  set.  In  this  sense,  our  test  data  sets  were 
comparable.  The  sizes  selected  for  this  part  of 
the  study  were  n  =  8000  and  16000 

observations.  We  also  used  various  numbers  of 
nodes,  so  that  the  speed-up  from  parallelizing 
could  be  determined. 

The  study  also  measured  the  effect  of 
differing  sizes  of  communication  packets  sent 
from  the  host  to  the  nodes.  The  sizes  used  in 
this  study  were  125,  250,  and  500  observations 
per  node.  Since  we  used  a  “broadcast” 
transmission  of  the  data  for  all  nodes,  with  each 
node  picking  its  data  out  of  the  message,  the 
size  of  packet  transmitted  also  depends  on  the 
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number  of  nodes.  The  size  of  the  actual 
transmitted  packet  is  number  of  nodes  times 
package  size.  It  might  appear  that  one  should 
automatically  choose  the  largest  possible  size  of 
packet  in  order  to  minimize  the  effect  of 
communications  start  up  overhead.  However, 
this  could  lead  to  nodes  remaining  idle  while 
transmission  is  taking  place.  Hence,  it  may  be 
more  effective  to  use  smaller  packets,  so  that 
the  nodes  may  continue  doing  productive  work. 
We  used  several  sizes  to  examine  the  effect  of 
package  size  on  overall  efficiency. 

We  measured  at  each  node  the  overall 
time  for  the  program,  the  time  waiting  for  the 
host  to  read  data,  the  computation  time,  and 
data  transmission  time  (both  sending  and 
receiving).  Our  measure  of  effective  time  for  the 
computations  is  the  maximum  over  nodes  of 
overall  time  minus  time  waiting  for  the  host  to 
read  the  data.  Hence,  the  computation  time 
and  the  communications  overhead  time  for  the 
hypercube  are  included  in  our  time  measure,  but 
the  time  for  the  host  to  initially  read  in  the  data 
is  not  included.  The  speed-up  for  any  given 
number  of  nodes  for  a  particular  configuration  is 
given  by  the  ratio  of  the  time  for  one  node 
divided  by  the  time  measure  for  that  number  of 
nodes.  Times  were  measured  by  an  internal 
clock  subroutine  on  the  hypercube  and  are  given 
in  milliseconds.  The  results  of  our  simulations 
are  given  in  Table  1.  Each  number  is  the 
average  for  two  runs. 

We  observe  from  Table  1  that  the 
effective  times  indeed  decrease  as  we  add  more 
nodes.  We  also  note  that  the  speed-up  is  not 
linear  in  the  number  of  nodes.  The  speed-ups 
achieved  for  the  six  rows  of  Table  1  (from  one 
node  to  sixteen)  are  respectively:  12.65,  13.55, 
11.79,  13.16,  10.43,  and  12.31.  The  speed-ups 
are  greater  for  the  larger  data  set  and  are 
greater  for  package  size  125  observations  per 
node  than  for  the  larger  packages.  The  reason 
that  speed-up  is  not  perfectly  linear  is  that 
communications  overhead  increases  as  the 
number  of  nodes  increases.  However,  as  might 
be  expected  for  perfectly  parallel  computations 
as  we  have  here,  the  computation  time  indeed 
decreases  as  the  reciprocal  of  the  number  of 
nodes.  In  fact,  a  regression  of  computation  time 


(again  the  maximum  over  nodes)  versus  sample 
size  divided  by  number  of  nodes  yields  an  of 
0.999978.  The  communication  overhead 
prevents  the  speed-up  from  achieving  perfect 
linear  speed-up. 

We  also  made  some  additional  runs  with 
larger  sample  sizes  to  explore  the  limiting 
behavior  of  the  speed-up.  We  wished  to  observe 
where  the  asymptote,  if  any  was  with  respect  to 
increased  speed-up  and  sample  size.  Hence,  we 
made  additional  runs  with  one  and  sixteen  nodes 
for  each  package  size  for  sample  sizes  32000, 
64000,  and  128000.  The  results  of  these  runs 
and  some  information  from  Table  1  are 
presented  in  Table  2. 

As  may  be  seen  from  Table  2,  the  speed¬ 
up  appears  to  have  an  asymptotic  value  of 
approximately  13.8  for  sixteen  nodes.  This 
seems  to  be  the  case  regardless  of  the  package 
size.  Nevertheless,  for  any  given  sample  size, 
the  smaller  package  size  gives  smaller  effective 
times  and  larger  speed-up.  Hence,  we  observe 
the  phenomenon  that  the  desired  efficiency  of 
the  larger  package  size  (namely,  the  lesser 
number  of  times  the  communication  startup 
overhead  is  involved)  is  overcome  by  the  fact 
that  the  nodes  sit  idle  waiting  for  data  to  arrive 
with  the  larger  package  sizes. 

We  make  one  final  remark  with  regard 
to  effective  time.  If  the  time  at  the  host  for 
reading  in  the  data  is  included,  the  time  to  read 
in  the  data  overwhelms  the  computations  for 
this  problem.  We  conceive  of  a  situation  in 
which  the  data  are  acquired  in  some  automated 
mode  which  can  bypass  the  reading  step  that  we 
did  here.  This  is  reasonable,  since  the  fault 
checking  feature  described  above  would  be 
critical  in  a  situation  where  the  data  arrived  in 
huge  amounts  and  weis  processed  in  an 
automated  fashion.  The  effective  time  as  we 
have  measured  it  gives  a  fair  reading  of  the 
speed-up  from  the  parallel  implementation  of 
regression.  All  communication  overhead  is 
included  except  the  reading  of  the  data.  This  is 
a  commonly  applied  method  for  measuring 
speed-up. 
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4.  A  Short  Example  Using  Kernel  Estimation 
and  Bootstrapping 

We  note  that  the  increase  in  efficiency 
(speed-up)  is  small  for  regression  calculations 
when  input  time  is  included.  Therefore,  we  have 
also  used  the  Hypercube  to  generate  speedup  for 
a  more  computationally  intensive  statistical 
procedure,  namely  Kernel  Estimation  with 
Bootstrapping.  Density  estimation  is  motivated 
as  follows.  Suppose  that  a  set  of  observed  data 
is  assumed  to  be  a  sample  from  an  unknown 
probability  density  function.  We  wish  to 
construct  an  estimate  of  the  underlying  density 
from  the  observed  data.  Kernel  density 
estimation  is  a  nonparametric  technique  used  to 
accomplish  this  estimation. 

Suppose  that  the  underlying  density  is 
f(x).  The  kernel  density  estimate  f(x)  is  defined 
by: 

Kx)  =  ^  t  x(^). 

isl 

where  n  is  the  sample  size,  h  is  the  width  of  the 
window  of  the  kernel,  the  X,-  are  the  observed 
data,  and  K(')  is  the  kernel  function.  K(-) 
satisfies  the  following  conditions:  jK(x)  dx  =  1, 
/x  K(x)  dx  =  0,  /K^(x)  dx  <  oo,  /(  X  f  K(x) 
dx  <  oo,  for  some  p.  The  limits  of  integration 
are  selected  appropriately  for  the  particular 
kernel  function.  Use  of  a  kernel  density  estimate 
is  similar  to  use  of  a  weighted  average  to 
estimate  a  parameter. 

We  performed  a  small  timing  study  of 
parallelizing  kernel  density  estimation.  We  used 
10000  data  points  and  used  four  kernel  functions 
to  obtain  four  different  kernel  density  estimates. 
The  total  processing  time  to  accomplish  this 
task  is  the  outcome  measure.  All  overhead  time 
(including  time  reading  the  data)  is  accounted 
for  in  the  measure. 

Again  we  use  the  maximum  of  the  node 
times  as  our  measure  of  the  node  processing 
time.  Time  is  in  milliseconds  and  each  time  is 
for  one  run.  The  results  for  1,  2,  4,  8,  and  16 
nodes  were  respectively:  8,292,879;  4,143,741; 
2,082,504;  1,055,863;  and  545,479.  The  speed¬ 
up  from  one  to  sixteen  nodes  was  15.20.  The 
parallelism  achieves  close  to  perfect  linear  speed¬ 


up  in  the  number  of  nodes.  The  communication 
time  is  extremely  small  compared  to  the  large 
amount  of  computation  time  for  this 
application. 

Bootstrapping  is  another  nonparametric 
technique  which  has  many  applications.  It  is 
used  to  obtain  estimates  and  standard  errors  for 
those  estimates,  as  well  as  to  estimate  bias  in 
estimates.  Bootstrapping  involves  repeated 
resampling  from  the  original  sample  of  data. 
We  have  applied  bootstrapping  to  the  density 
estimation  problem.  Preliminary  studies  show 
that  parallelizing  the  resampling  portion  of  the 
bootstrap  will  yield  significant  improvements  in 
processing  time.  Detailed  results  of  our 
bootstrap  study  will  be  presented  in  a 
forthcoming  paper. 

5.  Conclusions 

We  have  shown  that  significant  gains  in 
efficiency  may  be  had  by  parallelizing  statistical 
computations.  We  have  also  presented  a 
method  for  achieving  fault  detection  in  multiple 
regression  parallelized  computations.  This 
method  would  be  of  some  importance  in 
maintaining  a  “stand-alone”  system  with 
automated  input  and  processing  which  used 
multiple  regression  calculations  in  accomplishing 
its  mission.  We  plan  to  extend  the  fault 
detection  to  other  computations,  such  as  kernel 
density  estimation  and  bootstrapping. 
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Table  1 

Results  of  Part  1  of  the  Timing  Study 
Effective  Time 

Observations 

per  node  per  Sample  Number  of  Nodes 

package _ Size _ 1 _ 2 _ 4 _ 8 _ ^ 


125 

8000 

15337 

7722 

3931 

2058 

1212 

16000 

30703 

15458 

7854 

4088 

2266 

250 

8000 

15337 

7721 

3935 

2084 

1301 

16000 

30699 

15448 

7852 

4106 

2332 

500 

8000 

15358 

7733 

3957 

2155 

1472 

16000 

30737 

15467 

7877 

4155 

2496 

Table  2 

Results  of  Part  2  of  the  Timing  Study 

Observations  Effective  Time 

per  node  per  Sample  Number  of  Nodes 

package _ Size _ 1 _ 16  Speed-up 


125 

8000 

15337 

1212 

12.65 

16000 

30703 

2266 

13.55 

32000 

61471 

4460 

13.78 

64000 

122132 

8855 

13.79 

128000 

243648 

17667 

13.79 

250 

8000 

15337 

1301 

11.79 

16000 

30699 

2332 

13.16 

32000 

61445 

4521 

13.59 

64000 

122199 

8905 

13.72 

128000 

243714 

17685 

13.78 

500 

8000 

15358 

1472 

10.43 

16000 

30737 

2496 

12.31 

32000 

61515 

4661 

13.20 

64000 

122309 

9043 

13.53 

128000 

_ 1 

244085 

17815 

13.70 
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Abstract 

Certain  engineering  problems,  such  as  radar  cross 
section  modeling,  must  solve  systems  of  linear 
equations  AX  =  B,  where  A  is  a  large,  dense,  128-bit 
complex  matrix  and  B  is  the  matrix  of  right  hand 
sides.  Solution  to  problems  as  large  as  20000x20000 
are  needed  now  with  even  larger  problems 
anticipated. 

This  paper  describes  an  implementation  of  an  out  of 
core  linear  system  solver  on  the  Intel  iPSC/860,  a 
hypercube  with  Intel  i860  based  compute  nodes  and 
compatible  Intel  Concurrent  HO  system.  Block 
Gaussian  elimination,  with  restricted  pivoting,  is 
used,  paging  blocks  on  and  off  the  I/O  system  as 
needed.  Gaussian  elimination  requires  about  21.5 
(64-bit  real)  Teraflops  to  solve  the  20K  problem, 
which  would  take  2.5  days  on  a  WO  Mflop  machine. 

At  both  the  cube  and  node  levels,  the  basic  operation, 
C  =  C  -  A*B,  has  been  carefully  optimized.  A  hand 
coded  i860  assembler  kernel  is  used  at  the  node  level 
and  asynchronous  message  passing  and 
asynchronous  disk  I/O  overlap  almost  all  data 
movement  with  computation.  A  64  node  iPSC/860 
with  6  I/O  nodes  and  12  disks  factors  a  20K  problem 
in  about  four  hours,  sustaining  1.4  G flops.  Plans  to 
extend  the  algorithm  to  even  larger  problems  using 
tape  storage  will  be  described. 

The  iPSC/860  and  the  Concurrent  File  System 

The  iPSC/860,  a  distributed-memory  message¬ 
passing  multicomputer,  contains  up  to  128  separate 
compute  nodes  with  optional  I/O  nodes  and  disks. 
Each  compute  node  is  an  Intel  i860  microprocessor 
with  8  Megabytes  of  memory  and  a  FIFO-based 


interface  to  the  Direct-Connect  routing  hardware. 
Each  node  injects  and  removes  messages  from  the 
communication  system  at  a  rate  of  2.8  Mbytes/sec. 

Since  the  peak  double  precision  performance  of  the 
i860  on  matrix  computations  is  40  Mflops,  a  64  node 
system  has  a  peak  performance  of  2.56  Gflops.  This 
assumes  that  all  message  passing  and  I/O  required 
by  the  algorithm  is  completely  overlapped  with 
computation.  The  actual  transfer  of  bytes  between 
the  FIFO  interface  and  node  memory  is  done  by  the 
i860  to  maintain  coherency  of  the  on-chip  cache. 
Thus,  all  message  traffic  involves  some  cycle  stealing 
and  peak  performance  will  be  unobtainable. 

The  optional  I/O  system  on  the  iPSC/860  system 
provides  parallel  access  to  a  set  of  SCSI  disk  drives 
controlled  by  Intel  80386  microprocessor-based  I/O 
nodes.  Messages  between  I/O  nodes  and  compute 
nodes  compete  for  the  same  wires  which  node-to-node 
messages  use.  Each  I/O  node  can  read  or  write  a  disk 
at  about  1  Mbyte/sec.  The  Concurrent  File  System  ” 
automatically  spreads  files  across  the  available 
disks.  All  compute  nodes  independently  open  files, 
seek  locations,  and  read  data  simultaneously.  Data 
is  transferred  to  and  from  the  I/O  system  in  separate 
4K  byte  packets. 

For  a  detailed  description  of  the  i860  see  (1]  and  for  a 
detailed  description  of  the  iPSC/860  see  [2]. 

Block  Gaussian  Elimination 

The  Large  Out-Of-Core  Solver  (LOOCS™)  code 
implements  a  variant  of  block  Gaussian  elimination. 
The  matrix  is  divided  into  square  submatrices  called 
disk  sections  which  are  the  units  that  are  swapped 
off  and  on  the  disk.  When  A  is  a  /  x  /  matrix  of  disk 
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sections,  the  factor  algorithm  can  be  described  as 
follows; 

for  i  =  1,  t 

forj  =  l,i-l  //doithrow 
fork  =  l,j-l 
Aij  =  Aij  -  Aik  *  Akj 
endfor 

Aij  =  Aij  ♦  Aii  //Aii  is  already  inverted 
endfor 

forj  =  l,i-l  //doithcol 
fork  =  l,j-l 
Aji  =  Aji  -  Ajk  ♦  Aki 
endfor 
endfor 

forj  =  1,  i-1  //do  ith  diagonal 
Aii  =  Aij  -  Aij  ♦  Aji 
endfor 

Aii  =  inverse  of  Aii 
endfor 

The  corresponding  solve  algorithm,  for  one  block 
column  of  B,  appears  as; 

fori  =  l,t  //forward  elimination 
forj  =  1,  i-1 
Bi  =  Bi  -  Aij  *  Bj 
endfor 
endfor 

for  i  =  t,  I,  -1  //back  substitution 
forj  =  t,  i+  1,  -1 
Bi  =  Bi  -  Aij  ♦  Bj 
endfor 
Bi  =  Aii  ♦  Bi 
endfor 

This  variant  makes  all  changes  to  a  particular  disk 
section  at  once.  This  helps  minimize  I/O  since  disk 
sections  are  written  only  once  and  means  that  the 
algorithm  has  natural  checkpoints.  Every  section 
written  to  the  disk  is  Finished,  except  during  the 
forward  solve  which  can  be  handled  by  changing 
output  Files. 

Only  three  types  of  operations  arc  needed; 

l.C  =  C- A*B 


2.  C  =  A*B 

3.  explicitly  invert  C 

Explicit  inversion  of  diagonal  blocks  is  not 
necessary,  but  a  matrix-matrix  multiply  can  be 
parallelized  much  more  efFiciently  than  repeated 
forward  elimination  and  backsubstitution,  so  even 
though  the  inversion  is  expensive  in  flops,  it  pays  for 
itself  in  the  later  uses  of  the  diagonal  block. 

This  algorithm  is  potentially  unstable  since  pivoting 
is  done  only  inside  diagonal  disk  sections.  Unless  the 
matrix  is  diagonally  dominant,  a  diagonal  disk 
section  could  be  exactly  singular  causing  the 
algorithm  to  fail.  Ill  conditioned  diagonal  disk 
sections  are  an  indication  of  numerical  instability. 
Unfortunately,  the  only  guaranteed  solution  is  to 
pivot  down  the  whole  column,  which  is  much  too 
expensive  since  most  of  the  column  is  not  in  memory. 
However,  the  explicit  inversion  of  the  diagonal 
sections  makes  it  easy  to  compute  their  condition 
numbers.  This  provides  monitoring  of  the  stability  of 
the  algorithm. 

Square  disk  sections,  as  large  as  memory  will  allow, 
minimize  I/O  bandwidth  requirements,  since  the 
work  is  proportional  to  the  cube  of  the  disk  section 
size  but  the  I/O  is  only  proportional  to  the  square. 

Parallel  Matrix  Multiply 

Both  C  =  C  -  A*B  and  C  =  A*B  are  implemented 
simultaneously,  by  using  the  functionality  of  the 
BLAS3,  see  13),  routine  ZGEMM  which  computes. 

Equation  (1).  C  =  alpha*A*B beta*C, 

for  specially  chosen  values  of  alpha  and  beta.  The 
parallel  matrix  multiply  routine  is  a  folded  version  of 
a  systolic  algorithm  for  matrix-matrix  product  [4]. 
Assuming  &  kx  k  torus  of  processors,  a  subset  of  the 
hypercube  topology,  each  disk  section  is  divided  into 
square  node  sections.  At  the  beginning  of  a 
matrix  multiply  operation,  each  node  in  the  mesh 
will  have  one  specially  selected  node  section  of  the  A, 
B,  and  C  disk  sections.  Each  node  will  implement 
equation  (1)  on  node  sections  while  simultaneously 
passing  its  A  section  left  and  its  B  section  up.  If  node 
sections  are  large  enough,  messages  arrive  before  the 
node  computation  is  Finished  so  that  the  next  node 
section  multiply  need  not  wait. 
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The  disk  section  matrix  multiply  loop  consists  of  k  rank  one  update  of  the  active  submatrix  in  the  form 
phases  in  which  each  node  does  the  following:  of  equation  (1),  but  with  A  and  B  as  a  single  column 

and  a  single  row  respectively. 

Post  receive  from  down 

Post  receive  from  right  860  Optimization 

Post  send  to  left 

Post  send  to  right  To  obtain  optimal  performance  from  the  i860,  a 

Multiply  ZGEMM  routine  was  hand  coded  in  assembly 

Wait  for  completion  of  messages  language.  The  multiplication  of  the  matrix  A  by  a 

column  of  B  is  implemented  as  a  sequence  of  complex 
In  the  last  (ifcth)  phase  no  data  is  sent  since  there  will  ZAXPY  operations,  with  the  result  accumulated  in 
be  no  further  arithmetic.  the  data  cache  and  written  to  the  matrix  C  at  the 

end.  The  code  takes  advantage  of  pipelined 
To  obtain  the  correct  answer,  it  is  important  that  the  arithmetic  and  data  reads,  dual  instruction  mode, 

node  sections  of  the  A  and  B  disk  sections  be  plus  128  bit  reads  and  writes  between  registers  and 

carefully  allocated  to  compute  nodes.  Node  sections  the  data  cache.  The  asymptotic  speed  of  the  kernel  is 

in  the  first  row  of  A  are  assigned  to  their  natural  37.5  Mflopsper  node  with  an  n-halfof  10.  For  a  more 

processors  in  the  mesh.  The  next  row  is  circularly  detailed  description  of  this  kernel  see  [6]. 

shifted  one  position  left.  Each  succeeding  row  shifts 

further  left.  Similarly,  the  first  column  of  B  is  I/O  Optimization 

naturally  assigned  to  the  first  column  of  processors. 

The  second  column  shifts  up  one  position.  Each  The  I/O  system  must  contain  enough  disks  to  store 

succeeding  column  of  B  shifts  further  up.  Assuming  the  matrix  and  I/O  nodes  to  provide  adequate 

a  4x4  processor  mesh.  Figure  1  shows  the  assignment  transfer  rate  on  and  off  those  disks.  When  64  i860 

to  the  mesh  of  the  node  sections  of  an  A,  B,  and  C  nodes  solve  a  problem  with  228x228  node  sections,  6 

disk  section.  I/O  nodes  are  needed  to  provide  adequate  bandwidth. 

Ten  disks  would  be  enough  to  hold  a  20K  matrix,  but 
Parallel  Matrix  Inversion  since  most  systems  have  the  same  number  of  disks 

per  I/O  node,  twelve  were  used  in  the  benchmark 
Parallel  matrix  inversion  is  implemented  using  system. 

Gauss-Jordan  inversion  described  in  [5].  The 
computational  step  between  communications  is  a 


\  ^  T 

N  t 

N  ^  T 

^  r 

Cm  =Cii-Aii*Bii 

Ci2  =  Ci2-Ai2*B22 

Ci3  =  Ci3-Ai3*B33 

0i4  =  0i4-Ai4*B44 

N  T 

N  T 

.  N  T 

N  ^  T 

C21=C21-A22*B21 

C22  =  C22- A23*  B32 

023  =  ^23-^24*^43 

024  =  024-A2I*Bi4 

\  ^  t 

N  /  T 

N  ^  T 

N  /  T 

C31=C31-A33*B3i 

C32  =  C32-A34*B42 

C33  =  C33-A3i*B|3 

C34  =  034-A32*B24 

\  r 

N  T 

N  T 

N  T 

C41  =C41-A44*B41 

C42  =  C42-A4i*Bi2 

043  =  043-^42*023 

044  =  044-A43*B34 

T 

T 

T 

T 

Figure  1.  Initial  Node  Section  Assignment  For  Matrix  Multiply 


The  1/0  performance  was  optimized  in  three  ways. 
First,  the  required  I/O  was  pipelined  as  much  as 
possible,  so  that  blocks  needed  during  the  next 
multiply  were  fetched  during  the  previous  one. 
Furthermore,  writes  of  completed  blocks  were 
deferred  until  the  last  multiply  for  the  next  block 
when  no  fetch  was  performed.  This  allowed 
computation  of  the  next  block  to  begin  without 
waiting  for  the  previous  block  to  be  written. 

Second,  I/O  nodes  cache  certain  disk  blocks  in 
memory.  When  a  compute  node  is  reading  a  disk  file, 
the  I/O  nodes  try  to  preread  disk  blocks  so  that  when 
a  read  request  actually  arrives,  the  desired  block  is 
already  in  memory.  If  too  many  nodes  are  reading  at 
once,  the  memory  of  the  I/O  nodes  is  insufficient  to 
keep  the  readahead  blocks  in  memory  until  the 
actual  read  arrives.  Such  cache  thrashing,  which 
seriously  degrades  the  1/0  performance,  was 
eliminated  by  increasing  the  size  of  ‘he  cache  and 
limiting  the  number  of  nodes  reading  at  any  one  time 
to  16.  Similarly,  writes  were  restricted  so  that  only  8 
nodes  write  at  a  time.  This  was  accomplished  by 
staging  the  reads  and  writes  during  different  phases 
of  the  matrix  multiply  algorithm. 

Finally,  it  was  important  to  carefully  locate  the  I/O 
nodes  and  select  which  sets  of  compute  nodes  are 
reading  or  writing  at  the  same  time.  The  iPSC/860 
uses  fixed  routing  to  send  messages  to  avoid 
deadlock.  Dimensions  in  a  64  node  cube  are 
numbered  from  0  to  5.  Messages  are  routed  in  the 
lowest  needed  dimension  first,  working  up  to  the 
highest  needed  dimension.  Optimum  I/O 
performance  is  hampered  by  contention  for  wires. 
Wire  contention  for  messages  headed  to  the  same 
compute  node  is  of  no  import  since  the  messages  are 
serialized  at  the  node  anyway.  What  must  be  avoided 
is  wire  contention  for  large  messages  which  are 
headed  for  different  nodes.  Reads  and  writes  are 
asymmetric.  When  reading,  the  large  messages  go 
from  I/O  nodes  to  compute  nodes,  so  these  paths  need 
to  avoid  contention.  When  writing,  the  large 
messages  go  from  compute  nodes  to  I/O  nodes  so 
these  paths  are  the  critical  ones. 


The  sixty  four  compute  nodes  can  be  arranged  in  an 
8x8  mesh  in  which  nodes  in  columns  differ  only  in 
their  three  lowest  bits  and  nodes  in  rows  differ  only 
in  their  three  highest  bits: 
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I/O  nodes  should  be  anchored  to  compute  nodes  in  the 
same  row,  for  example,  the  top  row.  Routing  of 
messages  from  I/O  nodes  to  compute  nodes  follows 
the  hypercube  routing  down  the  column  and  then 
follows  the  hypercube  routing  across  the  row. 
Therefore  two  columns  of  nodes  can  read  without  any 
contention  among  paths  from  different  I/O  nodes  to 
different  compute  nodes.  The  paired  columns  are 
(0,56),  (8,48),  (16,40)  and  (24,32). 

When  writing,  the  paths  from  the  compute  nodes  to 
the  I/O  nodes  are  of  interest.  These  paths  go  up 
columns  followed  by  routing  in  the  row  of  anchor 
nodes.  Nothing  can  be  done  about  contention  in  the 
row  of  anchor  nodes,  but  one  row  of  nodes  should 
write  at  a  time. 

Performance 


Performance  of  the  parallel  matrix  multiply  routine 
is  summarized  in  Table  1.  There  is  no 
communication  on  a  single  node.  That  column 
measures  the  performance  of  the  assembly  language 
routine.  Degradation  of  performance  for  small 
problems  on  large  machines  is  characteristic  of 
message  passing  machines.  The  performance  of  64 
nodes  for  n  =  2048  is  35.9  Mflops/node  which  shoNvs 
that  little  is  lost  to  message  passing  overhead. 
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Table  1.  Matrix  Product  Performance  (Mflops) 
Nodes 


Dimension 

1 

4 

16 

64 

8 

17.1 

- 

16 

28.4 

15 

11 

- 

32 

34.1 

35 

53 

42 

64 

36.5 

106 

113 

141 

128 

37.4 

130 

207 

365 

256 

37.8 

141 

423 

763 

512 

- 

147 

548 

1379 

1024 

- 

- 

578 

2165 

2048 

- 

- 

- 

2300 

The  entire  LOOCS  code  was  timed  on  a  64  node 
iPSC/860  with  6  I/O  nodes  and  12  disks.  Table  2 
shows  times  and  Megaflops  obtained  on  four 
problems. 

Table  2.  Factorization  Performance 


Dimension 

Seconds 

Mflops 

2500 

146 

285 

5000 

565 

590 

10000 

2700 

987 

20000 

15300 

1394 

Solve  performance  depends  on  the  number  of  right 
hand  sides.  One  right  hand  side  is  completely  I/O 
bound.  A  full  disk  section  of  right  hand  sides  can  be 
solved  at  the  same  speed  as  the  factorization. 

Extensions  to  Tertiary  Storage 

Users  wish  to  solve  problems  of  order  100,000.  The 
current  algorithm  requires  160  Gbytes  of  disk  which 
is  not  cost  effective.  It  is  possible  to  extend  the  same 
hierarchical  decomposition  one  more  level  to  use  a 
tertiary  storage  medium  such  as  tape.  Disk  sections 
are  aggregated  into  large  square  tape  sections.  The 
block  algorithm  is  now  applied  to  tape  sections.  The 
three  types  of  operations  among  tape  sections  are 
implemented  as  a  sequence  of  operations  on  disk 
sections.  A  total  of  6  tape  sections  must  fit  on  disk; 
current  A,  B,  and  C  sections;  one  old  C  being  written 
to  tape;  and  two  new  A  and  B  sections  being  fetched 
from  tape.  The  tape  sections  should  be  made  as  large 
as  possible  subject  to  the  constraint  that  six  tape 
sections  fit  on  the  available  disk  space  Given  the 


transfer  rate  of  8mm  video  tapes,  12  1/0  nodes  with 
twelve  disks  and  6  I/O  nodes  with  12  8mm  video  tape 
drives  provide  sufficient  I/O  bandwidth  from  tape-to 
disk-to  cube-to  disk-to  tape  to  keep  a  128  node 
iPSC/860  compute  bound.  Such  a  machine  could 
factor  and  solve  a  double  precision  complex  system  of 
linear  equations  of  size  100,000  in  about  8  days. 

Conclusions 

The  LOOCS  code  running  on  the  iPSC/860  obtains 
more  than  half  the  theoretical  peak  performance  of 
the  machine.  It  provides  a  fast  and  cost  effective 
platform  for  solving  large  dense  systems  of  linear 
equations. 
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Abstract. 

The  parallelization  problem  can  be  divided  into 
three  main  stages:  identification  of  parallelism  which 
includes  dependency  analysis,  partitioning  the  state¬ 
ments  into  atomic  tasks  of  granularity  suitable  to  the 
target  architecture  and  scheduling  these  tasks  into  the 
processors. 

An  MIMD  coarse  grained  parallel  algorithm  is  de¬ 
veloped  for  the  triangular  Sylvester  equation.  We 
compare  well  known  scheduling  heuristics  such  as  the 
naive  and  compute-ahead  with  the  N-cp/misf  meth¬ 
ods  which  are  described  here.  These  methods  trade  off 
time  and  space  according  to  the  value  of  the  parame¬ 
ter  N.  Our  conclusion  is  that  the  N-cp/misf  methods 
are  faster  than  compute-ahead. 

1  Introduction 

1.1  Stages  of  parallelization 

As  mentioned  in  the  abstract,  the  parallelization  prob¬ 
lem  consists  of  three  important  stages: 

•  Identifying  parallelism  and  finding  the  data  de¬ 
pendencies. 

•  Partitioning  the  algorithm  into  indivisible  tasks 
and  and  the  data  into  corresponding  data  items. 
The  size  of  the  tasks  depends  on  the  problem  as 
well  as  the  target  architecture. 

•  Scheduling  the  execution  of  these  tasks  and  map¬ 
ping  the  data  items  into  a  given  multiprocessor. 

‘Supported  by  Grant  No.  8706122  from  NSF. 


In  the  first  stage,  the  programmer  or  the  com¬ 
piler  identifies  parallelism  at  the  finest  possible  grain. 
This  parallelism  is  then  represented  by  a  data  de¬ 
pendency  graph.  In  the  second  stage,  the  data  de¬ 
pendency  graph  is  partitioned  into  tasks  appropriate 
for  the  given  granularity  level  of  the  target  architec¬ 
ture.  Under  the  convexity  constraint,  Sarkar  [14],  this 
results  in  a  directed  acyclic  data  dependency  graph 
(DAG)  in  which  all  redundant  edges  have  been  deleted. 
For  message  passing  architectures,  the  data  must  be 
distributed  amongst  the  local  memory  of  the  proces¬ 
sors.  In  this  case  the  data  must  also  be  partitioned 
so  that  they  are  compatible  with  the  task  partitioning 
and  the  architecture.  In  the  final  stage,  the  data  items 
are  mapped  and  the  execution  of  the  partitioned  graph 
is  scheduled  in  the  given  multiprocessor.  In  this  pa¬ 
per,  we  consider  the  problem  of  static  list  scheduling 
and  data  mapping  for  MIMD  architectures  such  as  hy¬ 
percubes.  For  a  more  detailed  description  we  refer  the 
reader  to  Gerasoulis  and  Nelken  [6]  and  Nelken  [12]. 

1.2  The  problem 

Consider  the  matrix  equation 

AX+XB=C 

where  A,  B  and  C  are  known  m  x  m,  n  x  n  and  m  x  n 
real  matrices  respectively.  The  unknown  X  is  also 
m  X  n. 

This  equation  is  solvable  if  and  only  if  A  and  —B 
have  no  eigenvalues  in  common,  Golub  et  al.  [7].  Hence¬ 
forth,  we  will  assume  that  the  given  matrix  equation  is 
solvable.  A  transformational  solution  method  is  based 
upon  the  equivalence  of  the  original  problem  with 

iU-^AU){U-^XV)  -b  {U-^XV)(V-^BV)  = 
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The  transformational  solution  method  consists  of 
four  stages,  Golub  et  al.  [7]. 

1.  Transform  A  and  B  into  a  “simple”  form  by 
Al  =  U-^AU  and  Bi  =  V-^BV. 

2.  Compute  F  =  U-^CV. 

3.  Solve  the  transformed  system  AiY  +  YBi  =  F. 

4.  Compute  X  =  UYV~^,  the  solution  to  the  orig¬ 
inal  system. 

In  particular,  the  Bartles-Stewart  algorithm,  see  [1], 
uses  the  transformations  i4i  =  U^AU  and  Bi  =  V^BV 
where  U  and  V  are  orthogonal  matrices  which  are  cho¬ 
sen  so  that  Al  and  Bi  are  upper  quasi-triangular.  A 
quasi-triangular  matrix  is  triangular  with  possible  2x2 
blocks  along  the  diagonal. 

In  this  paper,  we  will  be  concerned  only  with  the 
third  step  of  this  procedure,  solving  the  transformed 
system.  We  will  assume  that  Ai  and  Bi  are  proper 
upper  triangular  matrices.  Thus  we  are  faced  with  the 
solution  of  a  triangular  Sylvester  equation  which  is  of 
the  form  AX  +  XB  —  C,  where  A  and  B  are  upper 
triangular  matrices.  For  simplicity  of  presentation  and 
analysis  we  also  assume  that  m~n. 

1.3  Parallel  time 

Our  aim,  of  course,  is  to  reduce  the  parallel  time,  Tp 
which  is  defined  as  the  elapsed  time  of  the  processor 
which  finishes  last  under  the  assumption  that  all  pro¬ 
cessors  begin  at  the  same  time. 

During  execution,  the  processor  which  finished  last 
is  either  idle  or  working.  The  idle  time  is  composed 
of: 

•  T/  -  Idle  time  due  to  synchronization  of  the  data 
dependencies 

•  Tc  -  Idle  time  due  to  communication  of  data 

•  Td  -  Idle  time  due  to  architectural  constraints, 
e.g.  bottlenecks  and  hot  spots. 

The  working  time  is  composed  of; 

•  Ta  -  The  arithmetic  time 


•  7b  -  The  parallel  program  overhe2ui. 

Thus,  the  parallel  time  is  given  by  the  following  sum 

Tp  =  Ta+Ti  +  Tc  +  Td  +  To. 

In  the  shared  memory  case,  Tc  is  substituted  by  Tl 
the  memory  latency  time. 

2  Identification  of  parallelism 

The  elements  of  X  can  be  computed  by  elementwise 
identification: 

*«f  = - TT - • 

On  +  Ojj 

If  the  equation  is  solvable,  au+bjj  ^  0  and  the  division 
above  can  be  performed.  The  x,-,  ’s  must  be  found  in  a 
certain  order  as  depicted  in  the  structure  in  figure  1. 
x„,i  is  found  first  and  is  labeled  by  a  1.  Then  x„,2  and 
can  be  computed  in  parallel,  they  are  labeled 
by  a  2.  Afterwards,  a:n-2,i.*n_i.2  and  *n,3  can  be 
found  in  parallel  and  they  are  labeled  by  a  3  and  so 
on.  Notice  that  all  elements  which  are  on  the  same 
diagonal  can  be  computed  in  parallel. 

4  5  6  7  \ 

3  4  5  6 

2  3  4  5 

1  2  3  4  / 

Figure  1:  A  structure  which  shows  the  order  in  which 
the  elements  of  X  are  solved. 

An  algorithm  with  less  arithmetic  operations  re¬ 
sults  if  we  update  the  matrix  C  after  computing  each 
element  of  X.  The  resulting  algorithm,  called  AXXBC, 
is: 

1.  Compute  a  matrix  element  Xij  =  77^:^  • 

2.  Update  elements  in  the  j’th  column  of  C  accord¬ 
ing  to  elements  in  the  i’th  column  of  A 

Cicj  =  Ctj  Utt  )  1  <  fc  <  I  —  1 . 

3.  Update  elements  in  the  i’th  row  of  C  according 
to  elements  in  the  j’th  row  of  B 

Ctt  —  Cn  —  bjt  Xij, 
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Let  us  define  the  following  operations: 


I  - 

au-khji 

ul(fc,i) : 

Ctj  ”  Ckj  —  Qfci  Xij 

and 

u2(i,  k)  : 

”  ^ik  —  ^jk  • 

The  fine  grain  data  dependency  graph  is  given  in  fig¬ 
ure  2  for  the  case  n  =  3.  Because  of  our  index- 

ing  scheme,  the  symbol  «l(Jb,j)  may  appear  several 
times,  each  time  for  a  different  value  of  the  index 
i.  Similarly,  ti2(i,  ib)  may  appear  several  times.  The 
lines  indicate  data  dependencies  from  top  to  bottom. 
For  example,  d(3, 1)  must  be  finished  before  any  of 
ul(l,  l),ul(2,  l),u2(3,2)  or  u2(3,3)  may  begin  execu¬ 
tion.  On  the  other  hand,  these  four  statements  can  be 
executed  concurrently. 


dO.l) 


Figure  2:  Fine  grain  data  dependency  graph,  n  =  3. 

3  Partitioning 

3.1  General 

In  this  stage,  we  partition  the  fine  grain  operations 
into  atomic  or  indivisible  tasks  whose  granularity  is 
suitable  to  the  target  architecture.  Many  partitioning 


are  possible  for  the  graph  in  figure  2.  Sarkar  [14]  has 
imposed  the  “convexity  constraint”  on  partitionings. 

A  convex  partitioning  is  one  that  satisfies  the  following 
conditions: 

•  A  task  can  begin  operating  when  all  its  inputs 
are  available.  It  operates  until  completion  and 
may  produce  outputs. 

•  Once  a  task  is  started  it  operates  until  comple¬ 
tion  without  interruption. 

The  motivation  is  that  non  convex  partitionings  may 
lead  to  arbitrarily  large  communication  and  synchro¬ 
nization  costs.  Convex  partitionings,  on  the  other 
hand,  have  an  acyclic  coarse  grain  dependency  graph 
that  can  be  used  on  the  macro-dataflow  model. 

3.2  According  to  rows 

Many  partitionings  of  the  fine  grain  data  dependency 
graph  in  figure  2  are  possible.  For  example,  algorithm 
SYLV.DIAG  of  Kagstrom  et  al.  [10]  uses  a  row  ori¬ 
ented  partitioning  which  is  depicted  in  figure  3.  Task 
(ib,ife)  finds  the  ib’th  row  of  X  and  task  {k,j)  use  the 
/fc’th  row  of  X  to  modify  the  j’th  row  of  C  for  j  <  k. 
In  the  figure,  all  statements  which  belong  to  a  task 
are  circumscribed  and  the  task’s  name  is  written  next 
to  them.  It  is  obvious  that  this  partitioning  is  not 
suitable  for  the  macro-dataflow  model.  For  example, 
task  (3, 2)  may  begin  as  soon  as  d(3, 1)  has  completed. 
However,  under  the  rules  of  macro-dataflow,  as  de¬ 
scribed  by  Sarkar  [14],  it  will  have  to  wait  until  (3, 3) 
has  completed.  Indeed,  the  SYLV,DIAG  algorithm 
of  [10],  sends  23,1  as  soon  as  it  has  been  computed 
and  starts  the  execution  of  (3,2). 

The  SYLV.DISTJB  and  SYLV.DIST.WB  algorithms 
of  [10]  also  use  a  similar  row  partitioning.  However, 
in  both  these  algorithms  columns  of  X  are  mapped 
into  the  processors  using  the  block  and  wrap  mappings 
respectively.  This  means  that  the  tasks  of  figure  3 
are  further  divided.  Each  statement  of  an  original 
task  is  to  be  executed  in  the  processor  which  stores 
that  element  of  X.  The  difference  between  the  two 
partitionings  stems  from  the  different  mappings.  In 
SYLV.DIST.B,  block  mapping  is  used  and  ^  contigu¬ 
ous  elements  of  each  row  X  are  solved  for  by  each 


Figure  3:  Row  oriented  partitioning  which  is  used  by 
SYLV.DIAG  and  a  finer  partitioning  used  by  t'le  other 
algorithms  for  n  =  3  and  p  =  3  (data  dependency  lines 
have  been  removed  for  clarity). 

processor.  SYLV-DIST-WB,  on  the  other  hand,  uses 
wrap  mapping  and  sends  each  element  of  Y  as  soon  as 
it  has  been  computed.  Therefore,  this  algorithm  uses 
messages  of  length  1.  For  a  description  of  block  and 
wrap  mapping  see  Ortega  [13]. 

In  our  example,  n  =  3  and  p  =  3,  both  block  and 
wrap  mappings  are  identical  and  so  are  the  partition¬ 
ings  of  SYLV-DIST-B  and  SYLV-DIST-WB.  These 
are  depicted  in  figure  3  by  the  dashed  lines  which  di¬ 
vide  each  original  task  of  SYLV-DIAG. 

The  SYLV-BLOCK  algorithm  divides  the  original 
matrix  X  into  blocks  and  then  solves  for  each  block 
using  another  algorithm  such  as  SYLV-DIAG.  For  our 
example,  the  blocks  are  of  size  1x1  and  the  partition¬ 
ing  obtained  is  the  same  as  that  of  SYLV-DIST-B  and 
SYLV-DIST-WB. 

3.3  According  to  diagonals 

We  consider  the  following  diagonal  partitioning  which 
groups  together  all  operations  performed  on  the  same 
diagonal.  In  figure  4  we  show  the  partitioned  fine  grain 
data  dependency  graph.  All  tasks  in  a  box  marked 


(ib,  j)  belong  to  task  .  For  example,  the  task  Ti  con¬ 
tains  the  statement  d(3, 1)  while  the  task  contains 
the  statements  ul(l,l),u2(2,2),ul(2,2)  and  u2(3,3). 
Note  the  definition  of  (3, 3).  We  have  lost  the  poten¬ 
tial  parallelism  between  d(l,  1),  d(2,2)  and  d(3,3). 


[d(^ 


Figure  4:  Partitioned  data  dependency  graph.  All 
tasks  in  a  box  marked  {k,j)  belong  to  task  Tl- 


The  number  of  diagonals  in  a  full  n  x  n  matrix 
is  2n  —  1.  However,  A  and  B  are  upper-triangular 
matrices  and  only  have  rt  diagonals  numbered  n,n  + 
1, . .  .2n  —  1.  The  number  of  elements  in  diagonal  k  is 
n  —  |n  —  ib|.  Task  Tj^  in  figure  4  consists  of  finding  the 
fc’th  diagonal  of  matrix  X  which  can  overwrite  the 
ib’th  diagonal  of  matrix  C,  it  uses  the  n’th  diagonal 
of  matrices  A  and  B.  Task  uses  diagonal  k  of  X 
(which  is  stored  and  accessed  as  diagonal  k  of  C)  to 
modify  diagonal  j  of  C.  It  uses  the  (n  +  j  —  ifc)’th 
diagonals  of  both  A  and  B. 

Given  the  above  partitioning,  we  find  Tj  and  the 
costs  of  executing  tasks  Tj^  and  T*  respectively.  These 
are  measured  in  terms  of  the  old  Flops  as  defined  by 
Golub  and  Van  Loan  [8].  The  index  arithmetic  and 
housekeeping  operations  are  not  counted  since  we  are 
only  concerned  with  floating  point  operations. 

r**  =  n  -  In  -  k\ 
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{2k  k  <  j  <  n 

2(n  -(j  -k))  k  <n<  j 
4n  —  2j  n  <  k  <  j. 

After  rearranging  the  nodes  of  the  partitioned  data 
dependency  graph  shown  in  figure  4  we  obtain  the 
DAG  which  is  shown  in  figure  5  for  n  =  4.  In  this 
figure,  each  task  is  represented  by  a  circle.  Inside  the 
circle  is  its  task  id.  To  the  right  of  the  circle  is  the 
weight  of  the  task,  Tj  or  t*.  To  the  left  of  the  circle 
is  the  level  of  the  task  which  is  defined  in  section  4.2. 

A  geometric  comparison  of  this  DAG  with  that  of 
GE,  see  Gerasoulis  and  Nelken  [6],  reveals  that  the 
GE  graph  is  wider  in  the  beginning  and  then  loses 
width  one  task  at  a  time  until,  towards  the  bottom, 
both  graphs  become  similar.  Thus  from  a  geometrical 
point  of  view,  the  partitioned  GE  has  more  potential 
parallelism  than  our  DAG. 

We  can  now  compute  a  lower  bound  on  the  parallel 
time  of  any  scheduling  which  uses  this  partitioning. 
Any  scheduling  must  require  at  least  the  length  of  the 
longest  path  L(a).  Also,  using  p  processors,  we  can 
not  expect  to  execute  faster  than  ^  where  Ti  is  the 
sequential  time  of  the  algorithm.  Thus  an  obvious 
lower  bound  is; 

Ttound  ^  niax{L(s),  =  max{3n’  -  2n,  — 

P  P 

3.4  Summary 

The  diagonal  oriented  partitioning  conforms  to  the 
macro-dataflow  model.  However,  the  row  oriented 
partitioning  is  used  in  a  way  which  violates  it.  Sarkar  [14] 
asks  whether  or  not  it  is  better  to  adhere  to  the  macro¬ 
dataflow  model.  He  mentions  that  specific  experi¬ 
ments  will  have  to  be  conducted  to  answer  this  ques¬ 
tion.  The  results  of  this  paper  can  be  seen  as  a  step 
in  this  approach. 

4  Scheduling 

4.1  General 

Under  the  assumption  of  zero  communication  cost  we 
examine  the  CP/MISF  scheduling  of  Kasahara  and 
Narita  [11]  to  the  partitioned  DAG  of  figure  5.  Then 


Figure  5:  The  DAG  for  n  =  4. 


we  assume  non-zero  communication  costs  and  briefly 
describe  a  four  step  scheduling  methodology. 


OSH  I 


4.2  CP/MISF 

Description 

Kasahara  and  Narita’s  [11]  CP/MISF  scheduling 
is  one  of  the  best  scheduling  heuristics  of  a  general 
DAG  when  communication  costs  are  assumed  to  be 
zero.  This  heuristic  has  three  stages; 

1.  Determine  the  level  for  each  node.  The  level  of 
a  node  is  the  longest  path  length  from  the  node  to  the 
terminal  node  and  a  path  length  is  the  sum  of  all  the 
task  weights  in  the  path.  In  figure  5  the  weights  and 
levels  have  already  been  determined. 

2.  Construct  a  priority  list  of  the  tasks.  Under  the 
CP/MISF  rules,  the  tasks  are  sorted  in  descending 
order  of  levels.  If  two  nodes  have  the  same  level  then 
the  task  with  most  immediate  successors  has  a  higher 
priority.  Ties  are  broken  according  to  lexicographic 
ordering.  Thus  we  sort  the  tasks  based  of  the  triad 
[level  I  number  of  successors  |  lexicographic  order]. 

3.  Perform  list  scheduling  on  the  priority  list. 
Whenever  any  processor  becomes  available,  it  scans 
the  priority  list  from  left  to  right  and  picks  up  the 
first  task  which  is  ready  to  be  executed  (i.e.  ail  its 
predecessors  have  completed).  A  task  which  has  been 
picked  up  for  execution  is  marked  as  taken  to  avoid 
picking  it  up  again. 

Performance 

To  measure  the  performance  of  a  scheduling  we 
define  R,  the  ratio  of  goodness 


In  figure  6,  we  plot  the  ratio  of  goodness,  R,  for 
the  CP/MISF  method  assuming  that  communication 
costs  are  zero. 

The  performance  of  the  method  is  indeed  remark¬ 
able.  It  is  within  1%  of  the  lower  bound,  which  leads 
us  to  ask  the  following  question: 

Is  CP/MISF  an  asymptotically  optimal  method 
for  the  above  problem? 

It  is  obvious  that  for  a  realistic  message  passing  ar¬ 
chitecture  with  non-zero  communication  costs,  CP/MISF 


Figure  6:  The  ratio  R  for  CP/MISF  assuming  Tc  =  0 
for  n  =  240. 

will  perform  poorly  because  of  its  unacceptably  high 
conununication  requirements.  Even  for  shared  mem¬ 
ory  architectures,  say  with  a  bus  and  local  memory, 
its  performance  could  deteriorate  because  of  high  data 
movement. 

5  A  scheduling  methodology 

The  scheduling  problem  for  message  passing  architec¬ 
tures  which  include  communication  cost  is  very  dif¬ 
ficult.  We  propose  the  following  four  step  heuristic 
approach; 

1.  Clustering: 

Find  a  “good”  schedule  for  an  unbounded  number 
of  virtual  processors  connected  as  clique.  Because  of 
the  existence  of  communication  cost,  this  stage  will 
generate  clusters  of  tasks  that  must  be  executed  by  the 
same  processor  on  the  target  architecture,  Sarkar  [14]. 
The  locality  assumption,  Gerasoulis  and  Nelken  [5], 
which  specifies  that  a  processor  may  modify  only  the 
data  which  is  stored  in  that  processor,  automatically 
determines  the  clustering.  For  our  example,  we  obtain 
the  following  clusters: 

Jbjf  ^  f'T’2n— 1  rp2n  —  l  'p2n— 1\ 

M7n-l-Un  i^n+l  »  •»^2n-l/- 

2.  Physical  mapping: 

The  2n  —  1  clusters  must  be  mapped  to  the  p  phys¬ 
ical  processors.  Each  processor  will  be  assigned  the 
tasks  of  several  clusters  in  an  attempt  to  load  balance 


Figure  7:  Work  profile  for  n  =  100. 

the  arithmetic  work.  With  each  cluster  M(j),  we  as¬ 
sociate  a  work  load  W{j)  where 

W(j)  =  arithmetic  work  in  Mj. 

The  work-profile  of  the  clusters,  George  et  al.  [4],  is  a 
graph  of  W(j)  against  j,  see  figure  7. 

It  can  be  seen  from  the  work  profile  that  we  can 
completely  load  balance  the  arithmetic  provided  that 
we  use  the  wrap  mapping  which  completely  load  bal¬ 
ances  the  arithmetic  if  2n  —  1  is  a  multiple  of  p.  The 
wrap  mapping  also  conforms  to  the  proximity  assump¬ 
tion,  see  Nelken  [12],  which  further  reduces  conununi- 
cation  costs. 

3.  Storage  of  data: 

To  reduce  communication  further  the  data  items 
that  are  accessed  most  by  the  tasks  in  each  processor 
are  stored  in  that  processor.  All  data  items  must  be 
stored  in  the  processors  before  execution  begins.  Our 
algorithm  has  four  matrices  to  be  considered:  A,  B,C 
and  X.  Since  matrix  X  overwrites  C,  we  will  only 
consider  the  three  matrices  A,  B  and  C. 

We  will  postpone  dealing  with  the  storage  of  A 
and  B  until  the  next  section  on  ordering.  As  for  the 
matrix  C,  we  associate  data  items  (i.e.  diagonals)  of 
matrix  C  with  the  clusters.  Each  cluster  Mj  is  as¬ 
sociated  with  the  data  item  which  it  accesses  most 
which  is  the  ^’th  diagonal  of  C.  The  locality  assump¬ 
tion  implies  the  definition  of  M(j)  and  that  data  item 
j  should  be  stored  in  the  same  processor  which  exe¬ 
cutes  cluster  M(j).  The  data  mapping  of  matrix  C 
is  obvious.  Since  we  have  used  wrap  mapping  of  clus¬ 
ters  to  proces.''ors,  we  also  use  wrap  mapping  of  the 


Figure  8:  N-cp/misf  vs.  compute-ahead  for  AXXBC 
with  wrap  mapping  and  n  =  240. 

diagonals  of  C  to  the  processors. 

4.  Task  ordering: 

In  this  stage,  the  tasks  assigned  to  each  processor 
are  ordered  and  execution  threads  are  formed.  In  a 
static  scheduler,  these  threads  are  determined  at  com¬ 
pile  time.  In  the  next  section,  we  will  describe  the 
N-cp/misf  ordering  methods  and  compare  them  with 
compute- ahead. 

6  The  N-cp/misf  methods 

In  this  section  we  define  the  N-cp/misf  methods.  We 
impose  the  memory  constraint  and  assume  that  a,-  di¬ 
agonals  are  mapped  to  each  processor  p,-  and  that  the 
same  processor  has  enough  space  to  store  an  addi¬ 
tional  6,  diagonals.  For  simplicity,  assume  that  a,-  = 
(2n—  l)/p and  bi  =  N  foiO  <i  <  p—  1.  Thus  N  is  the 
number  of  additional  data  items  that  can  be  stored  in 
each  processor. 

Our  approach  is  to  form  p  priority  lists  by  sorting 
the  tasks  assigned  to  each  processor.  The  tasks  as¬ 
signed  to  each  processor  are  sorted  according  to  the 
CP/MISF  criteria,  see  section  4.2.  We  then  use  a  mod¬ 
ified  list  scheduler  (MLS)  whose  input  are  the  p  pri¬ 
ority  lists  and  the  parameter  N,  and  whose  output 
is  a  scheduling  which  satisfies  the  DAG  dependencies, 
locality  assumption  and  memory  constraint. 

The  N-cp/misf  methods  are  derived  as  follows: 

1 .  Computation  of  levels:  Find  the  levels  for  the 
DAG  as  in  the  traditional  CP/MISF  method. 
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2.  Sorting:  The  p  groups  of  tasks  mapped  in  each 
processor  are  sorted  according  to  the  CP/MISF 
criteria.  Tasks  with  higher  levels  are  placed  in 
front  of  tasks  with  lower  levels.  If  several  tasks 
have  the  same  level,  then  they  are  sorted  accord¬ 
ing  to  the  number  of  outgoing  edges.  If  several 
tasks  have  the  same  level  and  the  same  number 
of  outgoing  edges,  they  are  sorted  lexicographi¬ 
cally.  Thus  our  sort  is  again  based  on  the  triad 
[level  I  number  of  successors  |  lexicographic  or¬ 
der]  in  each  processor.  These  are  the  p  priority 
lists. 

3.  Modified  list  scheduler:  In  the  MLS  each 
available  processor  scans  its  priority  list  from  left 
to  right  and  executes  the  first  task  that  satisfies 
the  following  conditions: 

•  The  task  is  ready  to  be  executed. 

•  The  execution  of  the  task  will  not  result  in 
a  scheduling  that  requires  space  for  more 
than  N  data  items. 

The  first  task  in  the  priority  list  which  satisfies 
both  constraints  is  scheduled  to  the  processor.  If 
no  such  task  exists,  the  processor  remains  idle. 

The  MLS  could  deadlock  if  the  execution  of  a  task 
is  ready  but  requires  A^-bl  space  and  a  descendant  that 
could  break  the  deadlock  depends  on  this  task.  The 
deadlock  can  be  broken  by  re-receiving  the  same  data. 
We  have  not  observed  this  situation  in  our  DAGs  and 
conjecture  that  it  does  not  occur. 

Note  that  there  is  a  range  of  N-cp/misf  methods. 
For  the  1-cp/misf  method  each  processor  needs  at  the 
most  space  for  1  additional  data  item  to  execute.  For 
N  =  ^  the  space  requirements  for  each  processor  are 
doubled.  At  the  extreme  is  the  “inP-cp/misf  method 
which  assumes  that  each  processor  has  infinite  mem¬ 
ory. 

In  figure  8,  we  compare  the  N-cp/misf  methods 
with  the  compute-ahead  rdering  in  terms  of  +T/  for 
both  approaches.  We  plot  the  quantity  1— 7’(^)/7’(ca) 
where  T{N)  is  the  -f  Ti  time  for  the  N-cp/misf 
method  and  T(ca)  is  the  corresponding  time  for  compute- 
ahead.  The  graph  represents  the  savings  achieved  by 


the  N-cp/misf  method.  1-cp/misf  has  the  same  per¬ 
formance  as  compute-ahead  but  "inf-cp/misf  is  up  to 
14%  faster. 

7  Comparing  with  SYLV_DIAG 

Kagstrom’s  algorithms  have  been  implemented  on  Duni- 
gan’s  [2]  message  passing  multiprocessor  simulator  (PP- 
SIM).  We  have  also  implemented  the  naive  and  compute- 
ahead  algorithms  on  PPSIM.  All  implementations  are 
in  single  precision  floating  point  C.  The  experimental 
results  are  for  the  case  m  =  n  =  64.  Kagstrom’s  im¬ 
plementations  as  well  as  our  own,  use  initial  simulator 
values  which  correspond  to  the  Intel  iPSC  cube. 

cube.init(0. 1 ,0.3,0.2,1024); 

The  speedup  of  compute-ahead  has  been  plotted  in 
figure  9.  It  should  be  compared  with  figures  4.a of  [10]. 
The  speedup  of  SYLV.DIAG  has  been  reproduced  in 
figure  9  from  data  given  to  us  by  Kagstrom  [9]  for 
comparison  purposes.  There  are  two  versions  of  the 
compute-ahead  program: 

•  The  first  version,  compute-ahead  with  local  copies 
of  A  and  B,  is  known  as  the  store  approach, 
copies  of  all  the  diagonals  of  A  and  B  are  stored 
in  each  processor. 

•  The  second  version,  compute-ahead  with  com¬ 
munication  of  A  and  B,  is  known  as  the  mix 
approach,  only  copies  of  the  main  diagonals  of 
A  and  B  are  stored  in  each  processor  while  all 
other  diagonals  are  communicated  as  required. 

Note  the  time  space  trade-off  between  the  two  ver¬ 
sions  of  the  compute-ahead  program.  The  store  ver¬ 
sion  exhibits  a  better  speedup  but  has  higher  local 
memory  requirements.  The  mix  version  of  the  pro¬ 
gram  exhibits  a  poor  speedup  due  to  the  fact  that 
diagonals  of  A  and  B  are  also  communicated  rather 
than  just  diagonals  of  C. 

It  should  be  noted  that  in  Dunigan’s  PPSIM  sim¬ 
ulator  [2]  the  communication  delay  of  a  message  of 
length  M  sent  across  h  hops  is 

sM  -1-  hrM 


29K 


Figure  9;  The  speedups  of  compute-ahead  and 
SYLV.DIAG,  m  =  n  =  64. 

where  s  is  the  startup  delay  value,  r  is  the  commu¬ 
nication  delay  for  a  floating  point  number  and  M  is 
rounded  up  to  the  nearest  packet  size.  In  the  compute- 
ahead  algorithm,  h  is  always  1  since  only  nearest  neigh¬ 
bor  communication  is  used.  In  this  case,  the  PPSIM 
delay  is  (s  +  r)M. 

On  real  hypercubes,  such  as  the  NCUBE,  the  delay 
for  a  message  of  size  M  communicated  between  neigh¬ 
bors  is  given  by  a-\-0M,  see  Dunigan  [3],  an  additional 
startup  factor  of  a.  Usually  a  0  and  therefore  it 
may  be  that  on  a  real  machine,  the  penalty  for  sending 
short  messages,  as  is  done  in  Kagstrom’s  algorithms, 
would  be  worse  than  it  is  in  the  simulator. 

It  should  be  pointed  out  that  the  results  in  figure  9 
are  for  a  small  case,  i.e.  n  =  64.  If  n  and  p  are  large 
the  results  may  be  different  since  the  diagonal  program 
requires  overhead  which  is  of  low  order.  Subroutine 
Tl ,  for  example,  has  index  arithmetic  which  is  done 
once  each  time  it  is  called.  If  n  is  large,  the  effect  of 
this  overhead  will  be  less  noticeable. 

8  Further  research 

Consider  the  N-cp/misf  orderings.  Unless  we  use  the 
store  version  of  the  program,  we  would  have  to  com¬ 
municate  the  diagonals  of  A  and  B.  The  communi¬ 
cation  strategy  would  have  to  be  modified  for  these 
orderings.  For  example,  in  order  to  determine  which 
diagonals  of  A  and  B  will  be  needed  by  each  pro¬ 
cessor  when  using  the  “inP-cp/misf  ordering  method, 
one  needs  to  know  the  scheduling  explicitly.  Further, 


the  communication  strategy  used  for  the  naive  and 
compute-ahead  orderings,  where  each  processor  sends 
all  of  its  diagonals  to  its  neighbor,  might  have  to  be 
changed. 

The  low  speedups  of  the  diagonal  algorithms  are 
due  in  part  to  the  fact  that  the  partitioned  graph, 
which  appears  in  figure  5,  has  less  potential  paral¬ 
lelism  than  that  of  the  non-convex  partitioning  used 
by  SYLV.DIAG.  This  is  a  partial  response  to  Seirkar’s 
question  [14]  about  the  restrictiveness  of  the  macro¬ 
dataflow  model.  In  this  case,  the  convex  partitioning 
does  not  exhibit  enough  parallelism  and  it  might  be 
better  to  use  a  non-convex  one.  More  accurate  analy¬ 
sis  and  actual  machine  tests  are  needed  to  determine 
which  of  the  alternative  approaches  is  better. 
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Abstract 

One  drawback  of  the  cyclic  one-sided  Jacobi  algo¬ 
rithm  is  the  necessity  of  computing  the  inner  product 
of  each  column  pair  to  be  rotated — indeed,  just  to 
check  whether  the  pair  needs  to  be  rotated.  How¬ 
ever,  if  the  orthogonality  of  a  column  pair  known 
to  be  orthogonal  during  a  sweep  is  not  subsequently 
destroyed  before  being  encountered  during  the  next 
sweep,  then  the  inner  product  computation  in  the  lat¬ 
ter  sweep  is  unnecessary.  The  number  of  such  in¬ 
ner  products  becomes  significant  as  the  process  nears 
convergence.  To  avoid  these  unnecessary  inner  prod¬ 
ucts,  the  usual  algorithm  is  extended  to  include  a 
data  structure  which  keeps  a  record  of  the  current 
orthogonality  status  of  the  column  pairs. 

In  the  parallel  setting,  the  reduction  during  the  lat¬ 
ter  sweeps  in  both  the  number  of  column  pairs  ro¬ 
tated  and  the  number  of  inner  product  computations 
in  a  sweep  will  not  generally  be  uniformly  distributed 
across  the  processors.  Because  of  the  resulting  loss  of 
load  balance,  the  actual  improvement  in  runtimes  will 
be  less  than  one  would  otherwise  expect.  A  statistical 
analysis  of  this  phenomenon  is  given. 

The  performance  of  the  enhanced  algorithm  is  ob¬ 
served  through  implementation  on  the  128-node  Intel 
iPSC/860  at  the  Oak  Ridge  National  Laboratory.  A 
discussion  of  the  correlation  of  these  results  with  the 
statistical  analysis  mentioned  above  is  also  given. 
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tract  DE-AC05-760R00033. 


1  Introduction 

It  was  observed  some  time  ago  by  Sameh  [12]  and 
Luk  [7]  that  the  one-sided  Jacobi  algorithm  was  well- 
suited  for  singular  value  and  eigenvalue  computation 
in  a  multiprocessor  environment.  More  recently  it  has 
been  observed  [1,4,5,6,10]  that  the  one-sided  Jacobi 
algorithm  is  especially  naturally  suited  for  such  com¬ 
putations  on  distributed  memory  and  vector  archi¬ 
tectures.  In  the  distributed  memory  environment,  its 
natural  parallelism  permits  excellent  load  balancing 
and  the  required  message  passing  is  relatively  small. 
Because  of  its  richness  in  vector  operations,  it  is  nat¬ 
urally  suited  for  vector  architectures. 

This  has  led  to  a  renewed  interest  in  the  one-sided 
Jacobi  algorithm.  Berry  and  Sameh  [1]  have  cITec- 
tively  implemented  it  in  a  multiprocessor  environ¬ 
ment.  Eberlein  [4,5]  and  Eberlein  and  Park  [6,8]  have 
done  so  on  distributed  memory  machines.  Rath  [9] 
and  de  Rijk  [10]  have  given  a  fast  Givens-type  vari¬ 
ant  of  the  one-sided  Jacobi  algorithm  which  reduces 
the  operation  count  and  permits  more  effective  vec- 
torization. 

The  Jacobi  algorithms  have  other  advantages.  It 
has  recently  been  shown  by  Demmel  and  Veselic  [.‘1] 
that  the  Jacobi  algorithms  compute  small  eigenvalues 
and  singular  vulitfis  arid  small  components  of  eigen¬ 
vectors  and  singular  vectors  with  higher  relative  ac¬ 
curacy  than  either  the  QR  and  Golub- Reinsch  algo¬ 
rithms  or  the  divide  and  conquer  method. 

One  drawback  of  the  cyclic  one-sided  Jacobi  algo¬ 
rithm  is  the  necessity  of  computing  the  inner  product 
of  each  column  pair  to  be  rotated — indeed,  just  to 
check  whether  the  pair  needs  to  be  rotated.  How¬ 
ever,  if  the  orthogonality  of  a  column  pair  known 
to  be  orthogonal  during  a  sweep  is  not  subsequently 
destroyed  before  being  encountered  during  the  next 
sweep,  then  the  inner  product  compulation  in  the  lat¬ 
ter  sweep  is  unnecessary.  The  number  of  such  inner 
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products  becomes  significant  as  the  process  nears  con¬ 
vergence. 

We  extend  the  usual  algorithm  so  that  most  of 
these  unnecessary  inner  products  are  avoided.  This 
is  done  by  use  of  a  data  structure  which  keeps  a 
record  of  the  current  orthogonality  status  of  the  col¬ 
umn  pairs. 

During  the  latter  sweeps  of  the  one-sided  Jacobi 
process,  there  is  a  reduction  in  both  the  number  of 
column  pairs  which  must  be  rotated  and,  as  noted 
above,  the  number  of  necessary  inner  products.  In 
the  parallel  setting,  these  reductions  will  not  gener¬ 
ally  be  uniformly  distributed  across  the  processors. 
Because  of  the  resulting  loss  of  load  balance,  the  ac¬ 
tual  improvement  in  runtimes  will  be  less  than  one 
would  otherwise  expect.  A  statistical  analysis  of  this 
phenomenon  is  given. 

The  performance  of  the  enhanced  algorithm  is  ob¬ 
served  through  implementation  on  the  128-node  In¬ 
tel  iPSC/860  at  the  Oak  Ridge  National  Laboratory. 
An  analysis  of  the  correlation  of  these  results  with 
the  statistical  analysis  mentioned  above  is  also  given. 
The  results  indicate  that  one  can  expect  about  a  10- 
20%  improvement  in  performance  for  a  small  number 
of  processors,  with  this  improvement  degrading  as  the 
number  of  processors  increases  due  to  the  statistical 
loss  of  load  balance  mentioned  above. 

2  The  One-Sided  Jacobi  Algo¬ 
rithm 

The  singular  value  decomposition  of  a  real  mxn  ma¬ 
trix  A,  m  >  n,  may  be  given  by 

A  =  f/Sl/^ 


where 

U^U=l„  =  V^V  and  E  =  diag{<Ti, . . .  ,<Tn) 

with  ai  >  ff2  >  ■  ■  ■  <Tn  >  0.  The  diagonal  entries  Oj  of 
E  are  called  the  singular  values  of  A.  The  columns  of 
U  and  V  are  called  the  left  and  right  singular  vectors 
(respectively)  associated  with  the  respective  singular 
values  of  A.  Since  AA^  =  UEE^U^  and  A^ A  = 
VETEV^ ,  these  singular  vectors  are  orthonormalize  I 
eigenvectors  associated  with  eigenvalues  of  AA^  and 
A^ A,  respectively,  and  the  singular  values  are  the 
square  roots  of  the  eigenvalues  of  A^ A. 

The  one-sided  Jacobi  Algorithm  for  computing  the 
singular  values  proceeds  as  follows. 

Given  A  G  R””*"  with  m  >  n,  one  first  com¬ 
putes  an  orthogonal  matrix  V  such  that  the  columns 
of  U  :=  AV  are  orthogonal.  If  one  then  sets  E  := 


diap(<Ti, . . .  ,<T„),  where  Oi  is  the  2-norm  of  the  «-th 
column  of  U ,  and  scales  the  columns  of  U  by  the  rr, 
to  form  U,  then  U  =  UE  and  hence 

A  =  UEV'^  and  U'^U  =  /„. 


Thus,  the  SVD  is  obtained  if  such  a  matrix  V'  can  be 
computed. 

The  matrix  V  can  be  computed  iteratively  as  a 
product  of  planar  rotation  matrices,  each  of  which 
rotates  a  pair  of  columns  of  A,  as  follows. 

Given  an  ordered  pair  [a,,  Uj]  of  columns  from  A.  a 
plane  rotation  can  be  determined  such  that  if 

[a.*,a/]:=[a.,a,l[  ^  1  , 


then  Oi’  and  aj’  are  orthogonal  and  ||n,’||  >  ||u>*|| 
(Here  and  in  the  sequel,  |1  •  ||  denotes  the  two-norm). 
An  algorithm  for  rotating  a  pair  of  columns  a,  and  oj 
of  A  is  the  following.  One  must  first  select  a  tolerance 
c  to  use  in  the  test  for  orthogonality. 

Algorithm  Rotate 

T 

A.  If  '||||°'‘  II  <  f,  then  a,  and  Oj  are  taken  to  be 
already  orthogonal  so 

set  a/  Oj  and  oj*  :=  Oj  provided  jj(7,||  >  ||ujj| 
set  a,*  :=  aj  and  Oj'  :=  n^  otherwise. 

MlM  ^ 


1.  Compute  c  and  t: 

Compute  t  :=  |r|  +  \/l  -t-  r-, 
where  r  :=  il'*.iK 

llflill  >  ll“>lli  set  t  ;=  j  if  7-  >  0  and 
t  :=  -j  if  r<  0; 


otherwise,  set  t  :=  — t  if  r  >  0. 
Compute  c  :=  ^ . 


2.  Rotate  n^  and  aj : 
a,"  :=  c(n,  4-  /n^) 
a/  c(0;  -  ta,). 

3.  Update  ||oi||‘  and  ((aj||‘:‘ 
lloi'll^  :=  ilaill"  +  tajaj 

■=  IKII*  -tajoj. 


The  one-sided  Jacobi  algorithm  is  performed  in 
sweeps,  each  of  which  con.sists  of  rotations  of  each 
of  the  N  :=  n(n  —  l)/2  possible  pairs  of  columns 
performed  in  some  fixed  sequence  with  the  ordet  of 
each  pair  [ui.aj]  chosen  to  ensure  a  specified  order 

’Note  that  for  r,  y  6  R".  if  «  =  c(i  +  ly)  and  v  =  c(y  —  tr), 
then  u^u  =  x^x  +  tx^y+  tu^v  and  v^v  =  y^y  —  tx^y—  tu^v, 
where  c  =  cosfl  and  t  =  tan  9  are  chosen  as  in  Algorillim 
Rotate. 
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l|ai*|l  >  ||aj*l|  of  the  result.  The  Jacobi  algorithm  is 
characterized  by  this  order,  called  a  Jacobi  ordering. 

The  sweeps  are  repeated  until  the  columns  are  pair¬ 
wise  orthogonal  as  measured  by  e  in  Algorithm  Ro¬ 
tate.  Pairwise  orthogonality  is  usually  determined 
by  counting,  for  each  sweep,  the  number  of  column 
pairs  that  were  already  orthogonal  when  encountered; 
when  that  number  is  N,  convergence  is  declared  (ac¬ 
tually,  —  1  will  suffice;  see  §5). 

To  rotate  a  column  pair  [ai,aj]  one  needs  to  have 
available  l|ai||,  ||aj||,  and  the  inner  product  ai^aj. 
However,  one  need  not  recompute  the  column  norms 
for  each  new  pair.  After  initializing  an  n-vector  to 
contain  the  squares  of  the  norms  of  the  columns,  this 
vector  can  be  easily  updated  after  each  rotation  as  in 
step  B3  of  Algorithm  Rotate.  When  convergence  is 
achieved,  this  vector  will  contain  the  squares  of  the 
singular  values  of  A. 

Unlike  the  column  norms,  the  inner  product 
must  be  computed  for  each  rotation.  These  inner 
products  form  a  significant  portion  of  the  floating 
point  operations  in  the  one-sided  Jacobi  algorithm. 
It  is  the  reduction  of  these  inner  product  computa¬ 
tions  that  we  address  in  §4  and  §5. 

3  The  Parallel  Algorithm 

Rotations  of  pairs  of  columns  with  disjoint  index  pairs 
are  independent.  Therefore,  if  index  pairs  in  a  Jacobi 
ordering  are  disjoint,  the  corresponding  rotations  can 
be  performed  in  parallel. 

In  the  parallel  Jacobi  algorithm  the  rotations  of 
a  sweep  are  partitioned  into  groups  of  independent 
rotations  with  each  group  of  rotations  performed  in 
parallel.  There  exist  many  parallel  Jacobi  orderings 
in  which  the  maximum  number  [n/2j  of  rotations  is 
performed  in  each  of  the  n  —  1  (n,  if  n  is  odd)  time 
steps.  It  is  such  optimal  orderings  that  we  use  in  our 
algorithm. 

Our  interest  is  in  implementation  of  the  parallel 
algorithm  on  a  ring-connected  distributed-memory 
multiprocessor.  The  implementation  we  use  follows 
that  introduced  by  Eberlein  in  [5]  and  futher  dis¬ 
cussed  in  Eberlein  and  Park  [6]. 

The  columns  of  the  matrix  are  partitioned  into  2p 
blocks  of  size  as  uniform  eis  possible,  where  p  is  the 
nu~iber  of  processors.  The  algorithm  assumes  that 
2p  <  n.  The  blocks  are  distributed  to  the  processors 
in  pairs  as  indicated  in  Figure  1  for  p  =  4.  Each  col¬ 
umn  is  extended  by  one  entry,  which  will  contain  the 
square  of  the  two-norm  of  the  column.  After  initial¬ 
ization,  this  entry  is  updated  with  each  rotation  as 
in  Algorithm  Rotate;  upon  convergence,  these  entries 


will  contain  the  squares  of  the  singular  values. 


Ai 

A2 

As 

-44 

B2 

Bs 

04 

Figure  1;  Distribution  of  columns  (p  =  4) 

One  begins  each  sweep  by  rotating  all  pairs  of 
columns  within  each  block  in  a  lexiographically  cyclic 
order.  Then,  after  rotating  each  column  of  block  A 
with  each  column  of  block  B  in  each  processor,  the 
blocks  are  redistributed  among  the  processors  accord¬ 
ing  to  the  communication  pattern  indicated  in  Fig¬ 
ure  2  for  p  =  4.  The  columns  of  block  A  are  again  ro¬ 
tated  with  the  columns  of  block  B  in  each  processor. 
The  sweep  is  completed  after  2p—  1  such  “subsweeps.” 


Odd  time  steps 


Even  time  steps 

Figure  2:  Communication  pattern 

Note  that  under  this  communication  pattern,  all 
information  for  computing  the  rotation  parameters 
is  available  in  the  processor  containing  the  column 
pair  so  the  columns  can  be  rotated  locally.  Hence, 
the  amount  of  communication  is  small,  with  only  one 
send  required  in  each  subsweep,  or  2p  —  1  per  sweep. 
Furthermore,  the  load  balance  remains  exceptionally 
uniform. 

4  Pairwise  Decoupling 

For  each  column  pair  encountered  during  a  sweep, 
one  first  computes  the  inner  product  of  the  columns 
and  then,  if  they  are  not  already  orthogonal,  rotates 
the  columns.  For  an  m  x  n  matrix,  each  inner  prod¬ 
uct  costs  m  multiplications  and  the  rotation  about 
4m  multiplications.^  In  the  latter  sweeps,  the  inner 
product  computation  becomes  dominant  since  many 

*The  count  for  a  rotation  is  2m  if  fast  rotations  are  tiserl 
(see  [9,10]) 
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of  the  column  pairs  encountered  are  already  orthog¬ 
onal.  A  typical  count  of  the  number  of  orthogonal 
pairs  encountered  in  a  sweep  is  illustrated  by  “Or¬ 
thogonal”  in  Table  1  for  a  random  128  x  128  matrix 
with  p  =  4. 


Table  1:  n  =  m~  128,  p  =  4 


Sweep 

Orthogonal 

Decoupled 

1 

0 

0 

2 

0 

0 

3 

0 

0 

4 

0 

0 

5 

0 

0 

6 

1 

0 

7 

1313 

0 

8 

7252 

98 

9 

8074 

3128 

10 

8128 

8069 

1.  Rotate  columns  t  and  j. 

2.  Set  the  j-th  flag  of  column  i  and  the  ?-th 
flag  of  column  j. 

3.  Clear  all  the  other  flags  of  columns  i  and  j. 

Of  course,  there  is  some  overhead  associated  with 
maintaining  this  data  structure.  Thus,  its  implemen¬ 
tation  should  be  delayed  until  that  sweep  where  the 
savings  in  inner  product  computation  exceeds  the  cost 
of  this  overhead.  Our  experience  indicates  that  begin¬ 
ning  implementation  with  the  sweep  following  that  in 
which  “Orthogonal”  becomes  positive  is  appropriate. 
Clearly  there  is  nothing  to  be  gained  by  starting  its 
implementation  before  this  point. 

For  reasons  given  in  §7,  the  actual  improvement 
in  runtimes  due  to  the  reduction  during  the  latter 
sweeps  of  either  the  number  of  column  pairs  rotated 
or  the  number  of  inner  product  computations  is  likely 
to  be  less  that  the  figures  in  Table  1  might  suggest. 


If  the  orthogonality  of  a  column  pair  known  to  be 
orthogonal  during  a  given  sweep  is  not  destroyed  be¬ 
fore  being  encountered  during  the  next  sweep,  then 
the  inner  product  computation  in  the  latter  sweep 
is  unnecessary.  To  avoid  these  redundant  inner  prod¬ 
ucts,  we  extend  the  algorithm  to  include  a  data  struc¬ 
ture  which  keeps  a  record  of  the  current  orthogonality 
status  of  the  column  pairs.  In  the  example  given  in 
Tabic  1,  the  column  “Decoupled”  gives  the  number 
of  such  inner  products  which  could  be  avoided  in  this 
manner. 

To  extend  the  algorithm,  each  column  is  assigned  a 
column  number  (this  is  unecessary  in  the  original  al¬ 
gorithm)  and  extended  to  include  an  u- vector  of  flags. 
The  vector  of  flags  of  a  column  remains  with  the  col¬ 
umn  as  it  is  sent  to  the  various  processors.  Orthogo¬ 
nality  of  column  i  with  column  j  is  indicated  by  both 
the  j-th  flag  of  column  i  and  the  i-th  flag  of  column  j 
being  set.  Requiring  both  flags  to  be  set  permits  one 
to  avoid  additional  interprocessor  communication. 

The  flag  management  is  as  indicated  in  the  follow¬ 
ing  extension  of  Algorithm  Rotate. 

Algorithm  Rotate 
(with  pairwise  decoupling) 

A.  If  flags  indicate  the  columns  are  orthogonal,  do 
nothing.  Otherwise, 


B.  If 


<  f,  then 


1.  Interchange  columns  i  and  j  if  |lai||  <  llaj  ||. 

2.  Set  the  j-th  flag  of  column  j  and  the  »-th 
flag  of  column  j. 


5  Subspace  Decoupling 

The  reduction  of  inner  products  described  in  the  pre¬ 
ceding  section  can  be  further  improved.  Suppose  // 
and  K  art  disjoint  collections  of  columns  for  which 
each  column  in  H  is  orthogonal  to  each  column  in  l\ . 
If  two  columns  in  H  are  rotated,  then  the  orthogo¬ 
nality  of  the  resulting  columns  with  the  columns  in 
K  is  preserved  since  K  is  contained  in  the  orthogonal 
complement  of  the  subspacc  generated  oy  H ,  and  the 
rotated  columns,  which  are  linear  combinations  of  the 
original  columns,  remain  in  this  subspacc. 

In  the  preceding  section  it  was  assumed  that  when 
two  columns  are  rotated,  the  orthogonality  of  the 
resulting  columns  with  any  other  column  could  no 
longer  be  assured.  However,  in  view  of  the  subs|)ace 
decoupling  noted  above,  if  a  column  is  orthogonal  to 
eew:h  of  a  pair  of  columns  to  be  rotated,  then  its  or¬ 
thogonality  with  the  updated  pair  is  assured  after  ro¬ 
tation. 

The  data  structure  of  the  preceding  section  can  be 
revised  to  reflect  this  preservation  of  orthogonality. 
The  following  further  revision  of  Algorithm  Rotate 
indicates  how  the  flags  can  be  managed  to  implement 
this  feature. 

Algorithm  Rotate 
(with  subspace  decoupling) 

A.  If  flags  indicate  the  columns  are  orthogonal,  do 
nothing.  Otherwi.se, 


If  <  (,  flien 


c-  If  cfei  a  '■  ‘I"" 


1.  Interchange  columns  i  and  j  if  ((n,[(  <  ((nj( 


30« 


2.  Set  the  j-th  flag  of  column  i  and  the  i-th 
flag  of  column  j. 


C.  If 


>  €,  then 


1.  Rotate  columns  i  and  j. 

2.  Set  the  j-th  flag  of  column  i  and  the  i-th 
flag  of  column  j. 

3.  For  each  k  other  than  i  and  j, 

a)  If  either  of  the  ib-th  flags  of  of  columns 

i  and  j  are  clear,  clear  both  of  them. 

b)  If  both  of  the  ^-th  flags  of  columns  i  and 
j  are  set, 

(i)  if  both  the  f-th  and  j-th  flags  of  col¬ 
umn  k  are  set,  do  nothing. 

(ii)  Otherwise,  clear  both  the  i-th  and 
j-th  flags  of  column  k. 


Implementation  of  this  enhancement  in  a  dis¬ 
tributed  memory  environment  appears  to  be  prob¬ 
lematic;  it  requires  either  signiflcant  additional  in¬ 
terprocessor  communication  or,  within  the  existing 
communication  pattern,  a  very  large  message  size.  It 
may  be  more  suitable  for  other  environments.  The 
authors  plan  to  explore  the  feasibility  of  the  use  of 
subspace  decoupling  in  a  subsequent  paper. 

One  interesting  consequence  of  subspace  decou¬ 
pling  is  that  when  N  -  I  (rather  than  N)  of  the  col¬ 
umn  pairs  encountered  during  a  sweep  are  already 
orthogonal,  then  convergence  can  be  declared.  For  if 
all  except  one  pair  is  orthogonal,  when  that  pair  is 
rotated  to  become  orthogonal,  its  orthogonality  with 
all  other  columns  is  preserved  because  of  subspace  de¬ 
coupling.  Hence,  upon  completion  of  the  sweep,  all 
N  pairs  are  orthogonal. 


We  note,  however,  that  showing  that  the  computed 
singular  values  are  ordered  is  complicated  by  the  fact 
that,  prior  to  convergence,  the  column  norms  need 
not  be  ordered.  In  particular,  when  multiple — or 
nearly  equal — singular  values  are  present,  this  com¬ 
plication  is  evident.  As  an  illustration,  consider  the 
matrix 


1 

-2 

1 


-2  1 


1 

1 


If  a  sweep  on  A  is  performed  by  rotating  the  column 
pairs  in  the  order  (1, 2),  (1, 3),  (2, 3),  then  before  the 
sweep  the  column  norms  are  3,  ^/6,  \/6  and  afterwards 
they  are  3,  x/TT,  1. 

We  now  turn  to  a  description  of  how  to  ensure  the 
appropriate  choice  of  rotation  angles  in  the  parallel  al¬ 
gorithm  with  the  communication  pattern  given  in  §3. 
This  choice  of  rotation  angles  is  incorporated  into  our 
algorithm. 

We  assume  a  fixed  order  of  the  columns  of  the  ma¬ 
trix  in  terms  of  the  initial  distribution  of  blocks  to 
the  processors  as  follows;  Within  blocks,  the  columns 
have  their  natural  order;  between  blocks,  the  columns 
have  the  order  given  by  the  ordering 


Ai,  B\,  A2,  B2,A3,  B3, . . . ,  Ap,  Dp 

of  the  blocks,  where  the  blocks  arc  initially  dis¬ 
tributed  to  the  processors  as  indicated  in  the  top  half 
of  Figure  3. 

We  first  note  that  under  the  communication  pat¬ 
tern  of  §3,  after  their  initial  distribution  to  the  pro¬ 
cessors,  the  blocks  of  columns  do  not  return  to  their 
original  location  until  after  two  sweeps.  The  location 
of  the  blocks  after  each  sweep  is  illustrated  in  Figure 
3  for  p  =  4. 


6  Ordering  the  Singular  Val¬ 
ues 

As  noted  in  §2,  when  one  rotates  the  column 
pair  [cii.aj]  to  produce  the  orthogonal  column  pair 
[a,*,aj*],  the  angle  of  rotation  can  be  chosen  to  en¬ 
sure  that  ||a,*||  >  ||a>’||.  Hence,  for  some  fixed 
ordering  of  the  columns,  the  rotations  can  always 
be  chosen  in  the  one-sided  Jacobi  algorithm  so  that 
l|oi*ll  >  lloi’ll  whenever  i  <  j.  It  has  been  suggested 
that  the  rotation  angles  be  chosen  in  this  manner  to 
cause  the  computed  singular  values  to  be  ordered. 

This  choice  of  rotation  angles  does,  in  fact,  cause 
the  singular  values,  upon  convergence,  to  be  ordered. 
Since  we  are  aware  of  no  proof  of  this  result  in  the 
literature,  we  supply  a  proof  in  [11]. 


-4i 

A2 

A3 

A4 

Bi 

B2 

B3 

Da 

After  even-1 

numbered  sweeps 

Ai 

Ba 

B3 

B2 

A4 

As 

A  2 

After  odd-numbered  sweeps 

Figure  3:  Location  of  blocks 

One  must  therefore  use  a  different  choice  of  rota¬ 
tions  during  alternate  sweeps. 

The  choice  of  rotation  angles  which  will  ensure  or¬ 
dering  of  the  computed  singular  values  is  as  follows. 
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Within  blocks,  the  norm  of  the  column  of  least  in¬ 
dex  should,  of  course,  be  maximized.  When  rotating 
between  blocks,  the  norms  of  the  columns  in  the  “up¬ 
per”  block  of  processor  1  should  be  maximized  during 
every  sweep;  in  the  other  processors,  the  norms  of  the 
columns  in  the  “upper”  block  should  be  maximized 
during  the  even-numbered  sweeps  and  those  in  the 
“lower”  block  should  be  maximized  during  the  odd- 
numbered  sweeps. 

7  Statistical  Loss  of  Load  Bal¬ 
ance 

The  most  attractive  feature  of  the  one-sided  Jacobi 
algorithm  for  computing  the  singular  value  decom¬ 
position  on  a  parallel  computer  is  the  near-perfect 
load  balance  that  is  maintained  across  the  processors 
during  the  computation.  This,  coupled  with  the  rela¬ 
tively  high  ratio  of  computation  to  communication  in 
the  algorithm,  allows  the  parallel  algorithm  to  attain 
nearly  perfect  speedup  over  its  serial  counterpart. 

However,  as  noted  in  Table  1  of  §4  (see  “Orthog¬ 
onal”),  as  the  process  nears  convergence  an  increas¬ 
ing  number  of  rotations  become  unnecessary.  Since 
these  avoided  rotations  need  not  be  distributed  uni¬ 
formly  among  the  processors,  the  total  execution  time 
of  the  parallel  one-sided  Jacobi  algorithms  depends 
upon  how  much  the  load  balance  has  been  perturbed. 
In  the  worst  case,  p—  1  of  the  processors  have  no  rota¬ 
tions  to  perform  during  a  sweep,  while  the  remaining 
processor  must  perform  a  complete  set  of  rotations. 
Despite  a  drastic  reduction  in  the  total  work  done,  no 
decrease  in  parallel  execution  time  would  be  realized. 
Conversely,  in  the  best  Ccise,  all  the  processors  share 
equally  in  the  reduction  in  computation,  thereby  re¬ 
ducing  the  total  execution  time  as  well.  In  general, 
the  amount  of  reduction  will  be  somewhere  between 
these  extremes.  Similar  remarks  apply  to  the  inner 
products  avoided  in  the  decoupled  algorithms  (see 
“Decoupled”  in  Table  1).  Thus,  eliminating  either 
the  unnecessary  rotations  or  the  unnecessary  inner 
products  may  adversely  affect  the  load  balance,  miti¬ 
gating  any  gains  that  might  be  obtained  by  reducing 
the  computation.  This  section  presents  a  statistical 
analysis  of  this  tradeoff. 

Clearly,  the  rotations  of  pairs  of  columns  do  not 
form  independent  events.  However,  determining 
the  (highly  complex)  correlation  among  the  pairs  of 
columns  is  problem  dependent,  and  therefore  beyond 
the  scope  of  the  model.  Hence,  we  make  the  simplify¬ 
ing  assumption  that,  given  the  total  number  of  rota¬ 
tions  performed  during  a  sweep,  each  pair  of  columns 
will  need  rotation  with  equal  probability.  That  is. 


if  there  are  8128  total  pairs  of  columns,  of  which 
4000  are  actually  rotated  during  a  sweep,  then  in  the 
model,  the  probability  of  any  pair  of  columns  requir¬ 
ing  rotation  is  4000/8128.  Comparison  of  the  model 
results  with  the  empirical  results  indicates  that  this 
simplification  is  not  too  extreme.  We  make  a  similar 
assumption  regarding  the  probability  that  an  inner 
product  will  need  to  be  performed  in  the  pairwise  de¬ 
coupled  algorithm. 

We  note  that  our  assumption  is  not  the  same  as 
assuming  that  the  number  of  rotations  (resp.,  inner 
products)  performed  during  a  sweep  is  evenly  dis¬ 
tributed  among  the  processors.  This  would  be  the 
optimum  distribution  (as  noted  above),  but  is  un¬ 
likely  to  occur  in  practice.  However,  even  with  the 
above  simplification,  a  closed-form  expression  for  the 
execution  time,  while  obtainable  in  principle,  is  too 
costly  to  evaluate.  Hence,  we  resort  to  a  Monte-Carlo 
simulation  of  the  behavior  of  the  algorithm.  The  sim¬ 
ulation  takes  three  input  parameters:  n,  the  number 
of  columns  in  the  matrix  A\  p,  the  number  of  pro¬ 
cessors  used;  and  c,  the  number  of  pairs  of  columns 
that  were  detected  as  being  already  orthogonal  (resp., 
the  number  of  unneeded  inner  products).  The  output 
of  the  simulation  is  the  expected  number  of  parallel 
rotations  (resp.,  inner  products)  that  were  avoided 
during  the  algorithm. 

One  iteration  of  the  Monte-Carlo  simulation  is  done 
by  constructing  a  p  x  (N/p)  array,  where  N  is  the 
total  number  of  pairs  of  columns  in  a  sweep.  Each 
row  of  the  array  corresponds  to  the  pairs  of  columns 
occurring  in  one  of  the  processors  during  the  sweep. 
The  columns  of  this  array  are  partitioned  into  blocks, 
representing  the  subsweep  boundaries,  when  the  pro¬ 
cessors  must  communicate.  Initially,  the  entire  array 
is  filled  with  zeroes.  If  during  the  given  sweep,  a  total 
of  c  of  the  N  rotations  (resp.,  inner  products)  were 
unnecessary,  then  c  random  locations  in  the  array  are 
set  to  one,  representing  the  pairs  that  do  not  need 
rotation  (resp.,  inner  product).  From  this  table,  it 
is  easy  to  calculate  how  much  parallel  work  has  been 
saved  during  that  sweep.  Note  that  simply  adding  up 
the  number  of  ones  in  each  row  is  not  sufficient,  since 
the  communication  done  between  neighboring  proces¬ 
sors  after  each  subsweep  affects  the  total  savings,  and 
this  must  be  taken  into  account. 

A  series  of  simulations  was  carried  out  for  a  wide 
range  of  input  parameters,  and  a  running  calculation 
of  the  mean  and  standard  deviation  of  the  results  was 
made.  The  assumption  is  that  the  ratio  of  the  mean 
to  the  standard  deviation  will  asymptotically  have  a 
normal  distribution.  After  an  initial  series  of  30  tri¬ 
als  (to  ensure  that  the  number  of  samples  was  large 
enough  for  the  asymptotic  assumption  to  hold),  the 
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model  was  terminated  when  the  standard  deviation, 
normalized  by  the  appropriate  factor  to  obtain  a  95% 
confidence  level,  was  under  0.25.  Since  we  are  inter¬ 
ested  in  the  number  of  rotations  (resp.,  inner  prod¬ 
ucts)  that  are  eliminated,  the  nearest  integer  is  suffi¬ 
cient,  permitting  a  relatively  large  tolerance.  Figure  4 
displays  a  typical  graph  of  results  for  hypercubes  of 
dimension  0  through  7,  using  1024  columns  and  values 
of  c  that  yield  c/N  «  O.lk  for  k  =  1,2,..., 9.  Note 
that  we  compute  the  number  of  rotations  (resp.,  inner 
products)  avoided,  so  the  length  of  the  columns  (i.e., 
number  of  rows  m  in  the  matrix)  is  not  needed.  How¬ 
ever,  the  savings  in  execution  time  depends  directly 
on  m. 

For  convenience,  we  define  r  :=  c/N .  The  graph 
shown  in  Figure  4  displays  the  ratio  of  the  number  of 
parallel  rotations  avoided  to  the  total  number  of  ro¬ 
tations  per  processor,  for  several  values  of  r.  We  note 
that  the  nature  of  the  simulation  (counting  numbers 
rather  than  time)  means  that  the  same  graph  dis¬ 
plays  the  ratio  of  the  number  of  parallel  inner  prod¬ 
ucts  avoided  to  the  total  number  of  inner  products  per 
processor,  where  c  now  represents  the  total  number 
of  inner  products,  rather  than  rotations,  that  were 
avoided.  From  the  graphs,  we  can  see  that  as  the 
number  of  processors  increases,  the  percentage  of  the 
total  possible  savings  that  is  actually  obtained  de¬ 
creases.  This  reflects  the  fact  that  for  larger  num¬ 
bers  of  processors,  the  potential  for  load  imbalance  is 


greater. 

Figure  4  seems  to  indicate  that  the  rapidity  of  the 
degradation  as  the  number  of  processors  increases  is 
largely  independent  of  r.  However,  Figure  4  fails  to 
take  the  magnitude  of  the  ta/a/ savings  into  account, 
only  the  parallel  savings.  Let  s(r,p)  represent  the 
parallel  savings  on  p  processors  for  a  given  value  of 
r.  Then  Figure  4  plots  s(r,p)/(N/p)  against  logjp. 
Figure  5  uses  the  same  data  as  Figure  4,  but  now  the 
ordinate  is  p  x  s{r,p)/c.  That  is.  Figure  5  displays 
the  total  useful  savings  divided  by  the  total  savings. 
In  the  serial  case  all  savings  are  useful,  so  this  has  the 
effect  of  normalizing  the  curves  to  intersect  the  ordi¬ 
nate  axis  at  1.0.  Note  that  the  curves  corresponding 
to  small  values  of  c,  (i.e.,  small  values  of  r)  fall  much 
more  rapidly  than  those  corresponding  to  large  values 
of  c.  This  reveals  that,  when  the  number  of  rotations 
(resp.,  inner  products)  is  small,  increasing  the  number 
of  processors  severely  degrades  the  amount  of  savings 
obtained  by  the  modified  algorithm. 

The  execution  time  of  the  one-sided  Jacobi  algo¬ 
rithm  can  be  expressed  as 


ntweeps 


N 

T=  [( - Si)mty-f4( - Ci)mtj-\-t  comn 

i=l  ^  ^ 


where  T  is  the  total  execution  time,  t j  is  the  time  for  a 
single  flop  (i.e.,  multiply-add  pair),  temm  is  the  time 
for  the  exchange  of  blocks  between  subsweeps,  -s  is 
the  number  of  parallel  inner  products  saved,  and  Ci  is 
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the  number  of  parallel  rotations  saved.  Both  Cj  and  s, 
are  obtained  from  the  Monte-Carlo  simulation,  given 
values  of  c  and  s  from  an  actual  run  of  the  algorithm. 
Note  that  for  the  standard  algorithm,  s,  =  0  for  all  i. 

Figure  6  shows  a  comparison  of  the  model  with  the 
execution  time  of  both  the  standard  and  modified  al¬ 
gorithms  for  a  problem  of  size  1600  x  256  with  various 
values  of  p.  Clearly  the  behavior  of  the  algorithms  is 
modeled  with  a  high  degree  of  accuracy. 


Figure  6:  Comparison  of  models  and  execution  times 


8  Numerical  Results 

Parallel  versions  of  both  the  standard  one-sided  Ja¬ 
cobi  and  the  pairwise  decoupled  algorithm  were  devel¬ 
oped  on  the  Intel  iPSC/2  and  later  moved  to  the  Intel 
iPSC/860  at  Oak  Ridge  National  Laboratory.  Both 
machines  are  hypercubes;  the  nodes  of  the  iPSC/2 
are  based  on  the  Intel  80386  processor,  while  the  860 
nodes  are  based  on  the  Intel  i860  processor.  The  re¬ 
sults  we  report  here  are  for  the  860,  since  it  is  faster 
(the  clock  rate  is  40  MHz),  has  more  nodes,  and  has 
more  memory  per  node  than  the  iPSC/2,  thus  allow¬ 
ing  us  to  solve  a  wider  range  of  problems.  In  fact,  on 
a  matrix  of  size  512  x  512,  128  nodes  of  the  860  ran 
the  one-sided  Jacobi  algorithms  at  over  143  Mflops, 
Higher  rates  are  attainable  for  larger  problems.  More¬ 
over,  the  core  of  the  floating  point  calculations  in 
the  algorithm  takes  the  form  of  inner  products  and 
DAXPY’s  (that  is,  y  =  ax  -t-  y).  These  have  been 


Table  2 

Stanc 

ard  (s)  vs. 

modified 

(m)  for 

n  =  128 

m  =  400 

800 

1200 

1600 

p  =  4 

s 

17.65 

35.01 

50.03 

70.56 

m 

14.88 

29.71 

43.59 

59.97 

A% 

15.7% 

15.1% 

12.9% 

15.0% 

8 

s 

9.65 

19.67 

31.18 

39.31 

m 

9.39 

18.65 

27.90 

37.50 

A% 

2.7% 

5.2% 

10.5% 

4.6% 

16 

s 

6.56 

12.92 

20.89 

27.49 

m 

6.43 

12.52 

18.76 

24.98 

A% 

2.0% 

2.9% 

10.2% 

9.1% 

32 

s 

5.11 

10.57 

14.58 

19.38 

m 

5.09 

9.59 

14.20 

18.65 

A% 

0.4% 

9.3% 

2.6% 

3.8% 

64 

s 

4.70 

8.18 

12.00 

17.44 

m 

5.04 

8.41 

12.22 

17.58 

A% 

-7.2% 

-2.8% 

-1.8% 

-0.8% 

coded  in  i860  assembly  language  to  lake  advantage 
of  the  floating  point  pipelining  available  on  the  CPU, 
and  future  versions  of  our  codes  will  incorporate  these 
routines. 

The  only  potential  drawback  to  using  the  i860  is 
that  its  ratio  of  communication  cost  to  computation 
cost  is  significantly  higher  than  for  the  iPSC/2.  How¬ 
ever,  in  the  one-sided  Jacobi  algorithms  given  here, 
computation  dominates  communication  provided  n  > 
2p. 

Analyzing  the  performance  data  from  the  parallel 
codes  is  difficult  since  the  convergence  of  the  one¬ 
sided  Jacobi  algorithm  depends  upon  the  Jacobi  or¬ 
dering.  That  is,  the  convergence  rate  depends  upon 
the  order  in  which  the  pairs  of  columns  are  rotated. 
However,  in  the  parallel  setting,  changing  the  number 
of  processors  used  also  changes  the  Jacobi  ordering. 
Therefore,  one  must  take  an  average  over  several  runs 
for  each  size  matrix  and  number  of  processors  to  ob¬ 
tain  meaningful  results. 

Tables  2  and  3  show  the  execution  time  of  both  the 
standard  and  modified  algorithms  with  n  =  128  and 
n  =  256  (respectively)  for  various  values  of  m  and 
p.  The  percentage  by  which  the  modified  algorithm 
is  faster  than  the  standard  algorithm  is  listed  beside 
the  times.  Notice  that  for  a  large  number  of  pro¬ 
cessors,  the  standard  algorithm  is  actually  faster  for 
some  problems.  This  agrees  with  the  analysis  in  the 
last  section,  since  the  potential  savings  is  too  small 
to  offset  the  overhead  of  maintaining  the  extra  dat  a 
structure.  However,  for  a  small  number  of  proces¬ 
sors,  the  savings  in  execution  time  can  reach  as  high 
as  20%. 


Table  3 

Stanc 

ard  (s)  vs. 

modified 

(m)  for 

n  =  256 

m  =  400 

800 

1200 

1600 

p  =  4 

s 

71.87 

135.18 

202.13 

272.08 

m 

57.63 

114.48 

171.29 

229.04 

A% 

19.8% 

15.3% 

15.3% 

15.8% 

8 

s 

39.38 

74.16 

110.95 

156.74 

m 

33.57 

66.18 

98.74 

132.37 

A% 

14.8% 

10.8% 

11.0% 

15.6% 

16 

s 

22.60 

44.28 

66.55 

87.66 

m 

20.11 

39.49 

58.65 

78.78 

A% 

11.0% 

10.8% 

11.9% 

10.1% 

32 

s 

uW 

29.26 

43.66 

57.66 

m 

14.08 

27.12 

39.90 

52.43 

A% 

4.8% 

7.3% 

8.6% 

9.1% 

64 

s 

11.45 

21.68 

32.23 

42.50 

m 

12.00 

21.89 

32.15 

38.35 

A% 

-4.8% 

-1.0% 

0.2% 

9.8% 

128 

s 

9.58 

19.75 

26.59 

35.34 

m 

10.46 

21.34 

27.52 

36.03 

A% 

-9.2% 

-8.1% 

-3.5% 

-2.0% 

9  Conclusions 

We  have  shown  that  the  standard  cyclic  one-sided  Ja¬ 
cobi  algorithm  for  the  computation  of  the  SVD  of  a 
rectangular  matrix  contains  a  significant  amount  of 
unnecessary  computation.  This  computation  takes 
the  form  of  inner-products  of  column  vectors  known 
to  be  orthogonal.  We  have  shown  that  a  simple 
modification  of  the  standard  algorithm  to  incorpo¬ 
rate  a  data  structure  that  monitors  such  orthogonal¬ 
ity  (which  we  call  “pairwise  decoupling”)  can  yield 
a  reduction  in  the  total  execution  time,  despite  the 
overhead  of  maintaining  and  updating  the  data  struc¬ 
ture. 

The  pairwise  decoupled  algorithm  was  imple¬ 
mented  on  a  distributed-memory  parallel  computer, 
the  Intel  iPSC/860.  The  amount  of  reduction  in  the 
execution  time  over  the  standard  algorithm  on  the 
860  ranged  up  to  20%  of  the  total  parallel  execution 
time.  However,  many  of  the  problems  exhibited  less 
improvement  than  a  simple  calculation  would  indi¬ 
cate.  This  discrepancy  was  explained  by  a  statistical 
model  of  the  execution  time  of  the  parallel  algorithm, 
using  a  Monte-Carlo  simulation.  Results  of  the  simu¬ 
lation  agreed  remarkably  well  with  the  empirical  re¬ 
sults  obtained  from  the  860. 

A  further  modification  of  the  Jacobi  algorithm  was 
proposed,  based  upon  “subspace  decoupling.”  Sub¬ 
space  decoupling  also  involves  a  data  structure  to 
monitor  the  orthogonality  of  pairs  of  columns.  The 
difference  is  that  in  the  pairwise  decoupled  algo¬ 


rithm,  when  a  pair  of  columns  is  encountered,  if  ei¬ 
ther  column  has  been  modified  since  their  previous 
encounter,  they  are  assumed  non-orthogonal.  In  the 
subspace  decoupled  algorithm,  if  one  column  of  a  pair 
to  be  rotated  has  been  modified  only  by  columns 
known  to  lie  in  the  orthogonal  complement  of  (he 
other  column,  they  are  still  presumed  orthogonal. 
The  potential  savings  in  execution  time  for  the  .sub- 
space  decoupled  algorithm  is  greater  than  that  for 
the  pairwise  decoupled  algorithm;  however,  the  com¬ 
munication  overhead  required  for  its  implementation 
in  a  distributed-memory  environment  is  prohibitive. 
We  are  continuing  to  investigate  the  implementation 
of  subspace  decoupling  on  shared-memory  machines. 
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Abstract 

Parallel  systems  are  in  general  complicated  to  utilize 
efficiently.  As  they  evolve  in  complexity,  it  hence 
becomes  increasingly  more  important  to  provide  li¬ 
braries  and  language  features  that  can  spare  the  users 
from  the  knowledge  of  low-level  system  details.  Our 
effort  in  this  direction  is  to  develop  a  set  of  basic  ma¬ 
trix  algorithms  for  distributed  memory  systems  such 
as  the  hypercube. 

The  goal  is  to  be  able  to  provide  for  distributed 
memory  systems  an  environment  similar  to  that 
which  the  Levels  Basic  Linear  Algebra  Subprograms 
(BLASS)  provide  for  the  sequential  and  shared  mem¬ 
ory  environments.  These  subprograms  facilitate  the 
development  of  efficient  and  portable  algorithms  that 
are  rich  in  matrix-matrix  multiplication,  on  which 
major  software  efforts  such  as  LAPACK  have  been 
built. 

To  demonstrate  the  concept,  some  of  these  Level-3 
algorithms  are  being  developed  on  the  Intel  iPSC/2 
hypercube.  Central  to  this  effort  is  the  General 
Matrix-Matrix  Multiplication  routine  PGEMM.  The 
symmetric  and  triangular  multiplications  as  well  as, 
rank-Sk  updates  (symmetric  case),  and  the  solution 
of  triangular  systems  with  multiple  right  hand  sides, 
are  also  discussed. 


not  only  become  easier  to  implement,  but  also  become 
portable.  This  has  previously  been  done  with  success 
for  serial  and  vector  machines  through  the  Basic  Lin¬ 
ear  Algebra  Subprograms  (BLAS)  [4,3],  which  among 
others  [6]  is  based  upon. 

The  high-level  algorithms  may  not  provide  op¬ 
timum  performance  measures,  but  our  goal  is  to 
trade,  say,  5-10%  performance  for  ease  of  implemen¬ 
tation  and  portability.  Previous  efforts  in  the  same 
spirit  include  the  hypercube  library  developed  at  Chr. 
Michelsen  in  Norway  [2]  and  SCHEDULE,  a  parallel 
programming  environment  for  FORTRAN  developed 
at  Argonne  [7j. 

To  adhere  to  a  familiar  standard,  we  will  attempt 
to  follow  the  Level-3  BLAS  (BLAS3)  [3]  calling  se¬ 
quences  as  closely  as  feasible  for  our  distributed  mem¬ 
ory  case.  Section  2  describes  the  BLAS  in  more  de¬ 
tail,  whereas  the  additional  parameters  needed  in  the 
distributed  memory  setting,  follow  in  Section  3.  The 
core  routine,  general  matrix-matrix  multiplication,  is 
described  in  Section  4.  Section  5  discusses  the  other 
BLAS  routines,  rank-2k  updates  (symmetric  case), 
triangular  multiplication,  and  the  solution  of  trian¬ 
gular  systems  with  multiple  right  hand  sides,  respec¬ 
tively.  Future  work  and  some  of  the  issues  related  to 
the  iPSC/2  implementation  are  mentioned  in  Section 
6.  Finally,  a  summary  is  given  in  Section  7. 


1  Introduction 

The  goal  of  this  work  is  to  provide  a  set  of  basic 
“universal”  matrix  subprograms  for  the  distributed 
memory  environment  that  would  allow  programmers 
to  implement  algorithms  rich  in  matrix-matrix  oper¬ 
ations  in  terms  of  these  basic  subprograms.  Local 
communication  primitives  could  hence  be  hidden  in 
the  low-level  routines  and  the  new  high-level  routines 


2  The  BLAS 

The  advantages  of  defining  a  set  of  basic  linear  al¬ 
gebra  routines  that  higher-level  linear  algebra  algo¬ 
rithms  can  be  built  on  top  of,  were  originally  dis¬ 
cussed  by  Hanson  et  al.  back  in  1979  [12].  The  sub¬ 
programs  have  later  evolved  through  joint  efforts  by 
Dongarra  et  al.  [5]  The  original  routines  (now  dubbed 
Level- 1  routines)  limited  themselves  to  vector- vector 
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operations,  whereas  the  Level-2  routines  [4]  handle 
vector-matrix  operations,  and  the  Level-3  routines  [3] 
explore  matrix-matrix  operations. 

With  their  low  number  of  data  touches  (and  hence 
less  communication  needed)  compared  with  number 
of  arithmetic  operations,  the  problems  the  BLAS3 
cover,  prove  very  suitable  for  distributed  memory 
computers  To  follow  up  on  this  familiar  standard 
from  the  sequential  and  shared-memory  world,  we 
have  decided  to  follow  the  BLAS3  conventions  for 
calling  parameters  wherever  possible. 

For  example,  the  GEneral  Matrix-Matrix  multi¬ 
plication  routine  in  BLAS3  has  the  following  calling 
format: 

GEMM(TRANSA,  TRANSB,  M,  N,  K,  a.  A, 
LDA,  B,  LDB,  /3,  C,  LDC), 

where  TRANSA,  TRANSB  describes  whether  A  or 
B  transposed  or  not;  M,  N,  K,  the  matrix  dimensions; 
a,  /?,  scalars;  LDA,  LDB,  LDC,  leading  dimensions 
of  A,  B,  C,  respectively.  The  additional  parameters 
needed  in  the  distributed  setting,  are  appended  to 
the  BLAS3  calling  sequences. 

3  Data  Distribution  and  Other 
Calling  Parameters 

In  the  distributed  memory  case,  extra  parameters  be¬ 
yond  the  ones  provided  in  the  BLAS  are  needed  for 
specifying  items  such  as  the  topology  of  the  network 
assumed,  the  data  distribution  desired,  and  possibly 
also  parameters  for  indexing  subclusters  of  proces¬ 
sors.  These  parameters  open  up  endless  choices.  We 
will,  however,  restrict  ourselves  to  some  of  the  most 
fundamental  and  useful  ones.  Many  more  options 
may  be  desirable,  but  too  many  choices  defeat  the 
purpose  of  having  a  few  “core”  routines  that  man- 
uf2u:turers  may  be  willing  to  supply.  It  is  the  hope 
that  sometime  in  the  future  the  ideas  behind  these 
routines  not  only  provide  a  standard  for  parallel  li¬ 
brary  builders,  but  that  optimized  routines  also  be¬ 
come  standard  parts  of  future  languages  or  operating 
system  kernels. 

Our  data  distribution  choices  are:  block- 

submatrix,  block-vector,  and  wrap-block-vector. 
Block-submatrix  distributions  facilitates  orthogonal 


tree  structures  [9,8]  which  may  be  introduced  to  min¬ 
imize  communications  costs  compared  to  the  more 
conventional  distributed  hypercube  algorithms  [16, 
13].  The  orthogonal  structures  also  makes  virtual 
transposes  feasible. 

The  block-vector  structure  also  maps  well  to  hy¬ 
percubes  and  meshes  (through  ring  structures)  and 
is  the  most  common  distribution  of  data  in  numer¬ 
ical  problems.  Since  the  individual  vectors  remain 
undistributed,  it  is  easier  to  keep  track  of  the  data 
when  doing  vector  oriented  operations. 

Finally,  the  wrap  block- vector  mapping  is  consid¬ 
ered  since  it  provides  superior  load  balancing  in  for 
several  numerical  algorithms  [10,14].  As  the  standard 
block-vector  approach,  it  is  implemented  using  a  ring 
structure.  The  extra  parallel  parameters  (input-distr, 
output-distr,  and  network),  will  be  added  to  the  end 
of  the  parameter  list,  and  the  routines  renamed  with 
a  P  for  Parallel  in  front  of  the  BLAS3  name  (e.g. 
PGEMM,  for  standard  general  matrix-matrix  multi¬ 
plication). 

The  most  common  and  useful  network  topologies 
include  hypercubes,  grids  (including  torus),  rings, 
and  trees.  This  list  may,  however,  be  extended  as 
novel  architectures  take  on  other  topologies.  This 
parameter  is,  perhaps,  the  only  one  that  has  to  be 
modified  when  porting  code  between  different  archi¬ 
tectures.  Efficiency  of  the  code  will,  however,  be 
somewhat  linked  to  the  data  structures  (though  the 
communication  bandwidth  of  the  system  is  probably 
of  more  importance).  For  instance,  true  ring  topolo¬ 
gies  do  not  emulate  grid  structures,  broadcast,  and 
gather,  as  efficiently  as,  say,  hypercubes. 

Useful  communication  structures  include  rings, 
trees,  and  meshes.  Rings  are  commonly  used  in  nu¬ 
merical  algorithms  where  operations  are  performed 
on  block-vectors.  They  may  be  embedded  on  a  hyper¬ 
cube  network  using  all  nodes  by  numbering  the  pro¬ 
cessors  according  to  1-D  Gray  codes.  [17,1,9].  The 
Gray  codes  ensure  that  processors  that  are  next  to 
each  other  in  the  ring  structure  also  achieves  near- 
neighbor  communication  between  physical  hypercube 
nodes.  This  embedding  also  includes  a  spanning  tree 
(Figure  1). 

Meshes  (including  toroidal  connections)  may  sim¬ 
ilarly  be  embedded  on  hypercubes  using  2-D  Gray 
codes.  This  embedding  includes  a  set  of  orthog¬ 
onal  tree  structures  [9].  Whereas  tree  structures 
provide  efficient  structures  for  broadcast  and  gather 
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(both  processor-row/column-wise  and  network-wide), 
grids  -  perhaps  the  most  common  parallel  network  - 
are  well-suited  for  block-submatrix  data  distributions 
which  are  common  in  applications  such  as  image  ans 
seismic  processing. 


Figure  1.  Ring  embedding  hypercube  using  bi¬ 
nary  reflective  Gray  code.  Tree  structure  for  broad¬ 
cast/gather  also  shown. 

4  Parallel  Matrix-Matrix  Mul¬ 
tiplication 

The  general  matrix-matrix  multiplication  routines 
(GEMMs)  are  the  core  of  the  BLAS-3  library.  For 
real  matrices  A,  B,  and  C  (a  and  /?  are  scalars),  the 
operations  can  be  described  as  follows: 

C  «—  aAB  +  pc,  where  where  A  and/or  B  may 
be  transposed. 

The  scalar  multiplication  (a,  /?)  may  simply  be  per¬ 
formed  by  broadcasting  the  scalar  value(s)  to  each 
node  and  then  perform  the  scaling  locally.  We  shall, 
however,  assume  that  the  scalars  a  and  ^  =  1  in 
our  discussion  for  simplicity.  Their  computation  will 
also  not  be  affected  by  the  data  distribution  (block- 
column  or  submatrix-block).  A  discussion  of  the  ma¬ 
trix  product  AB  follows.  Note,  that  the  matrix  ad¬ 
dition  included  in  the  matrix  operations  above,  gets 
performed  along  with  the  multiplication  as  the  result 
gets  added  in  during  the  accumulation  of  the  summa¬ 
tion  of  Y^aubkj. 

Of  the  four  BLAS  permutations  allowed  through 
the  TRANSA  and  TRANSB  options  (see  last  sec¬ 
tion),  we  will  first  take  a  closer  look  at  the  AB  case. 


On  a  torus,  computing  the  the  products  ancbtj  us¬ 
ing  block-submatrix  distributions,  can  be  achieved  by 
rotating  the  distributed  B  matrix  east-west  through 
the  processor  plane  as  the  appropriate  data  reaches 
the  processors.  Orthogonal  structures  may  then  be 
used  to  gather  the  summations.  These  structures  will 
also  be  used  for  the  A^B  and  AB'^  cases. 

Similarly,  in  a  ring  setting,  whole  block- vectors  are 
rotated  left-right  on  a  ring  instead  of  the  submatrices 
for  a  mesh  in  the  summation  phase.  Notice  that  the 
ring  structure  mapping  also  provides  a  binary  tree 
structure  when  implemented  using  the  binary  reflec¬ 
tive  Gray  code.  This  allows  for  efficient  broadcasts 
and  gatherings  of  data. 

For  A^B  and  AB^,  which  matrix  (A  or  B)  to  ro¬ 
tate  through  the  processors  in  order  to  avoid  stride 
problems,  depends  on  the  storage  convention  of  the 
matrices.  Finally,  in  the  A^B^  case,  the  data  needs 
to  be  “transposed”  in  order  to  compute  the  products 
Oikbki-  The  data  may  also  need  to  be  reordered  lo¬ 
cally  to  avoid  stride  problems. 

If  one  of  the  matrices  A  or  B  is  symmetric,  then, 
either  A  =  A^  or  B  =  B^.  These  cases  can  hence 
be  viewed  as  the  A^B  and  AB^  cases  described 
in  the  previous  section.  We  are  here  assuming  that 
compressing  the  storage  of  symmetric  matrices  is  not 
worth  while  a  the  distributed  memory  setting.  Al¬ 
though  more  costly  in  storage,  the  cost  in  increased 
algorithmic  complexity  seems  to  outweight  the  bene¬ 
fits.  Also,  in  the  block-submatrix  case  one  would  not 
be  able  to  take  full  advantage  of  the  orthogonality 
of  the  hypercube  structure  if  storage  compression  is 
used. 


5  Other  BLAS  Routines 

Following  a  brief  discussion  of  how  the  distributed 
memory  setting  will  affect  the  rest  of  the  BLAS  rou¬ 
tines. 


5.1  Rank-SA;  Updates  of  a  Symmetric 
Matrix 

We  here  consider  the  following  updates  of  a  symmet¬ 
ric  matrix  C  covered  by  the  BLAS3: 
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C  ^  aAB^  +  aBA"^  +  pC 

C  ^  aA^B  +  aB'^ A  +  pc 

The  Rank-lib  cases  are  covered  by  the  general 
matrix-matrix  multiplication  routines.  In  the  case 
of  the  rank-2ib  updates  of  symmetric  matrices,  all 
matrix- matrix  products  are  of  the  form  B  or 
AB^,  which,  as  mentioned,  does  not  require  trans¬ 
positions.  Notice  that  since  AB^  =  (BA'^)'^  and 
A'^B  =  {BA  A)^ ,  only  one  of  the  products  needs  to 
be  computed  and  the  remainder  of  the  computation 
reduces  to  matrix  additions  (with  scaling  with  a  and 
0)- 

5.2  Triangular  Matrix  Multiplication 

We  here  consider  permutations  of  multiplying  the 
dense  matrix  B  with  a  triangular  matrix  T: 

B  <—  aTB,  where  T  and/or  B  may  be  transposed. 

Notice  that  if  one  considers  T  upper  triangular,  then 
the  case  would  represent  the  lower  triangular  case 
and  vice  versa. 

Triangular  matrix  multiplication  may  easily  be 
performed  redistributing  only  data  from  T.  In  the 
ring/block-vector  case,  the  two  first  multiplications 
(TB  and  t'^B)  may  be  computed  rather  straight¬ 
forwardly  (with  respect  to  communication)  since  the 
B  matrix  already  is  distributed  in  the  same  block- 
column  fashion  as  used  for  the  general  multiplica¬ 
tion  case.  For  BT  and  BT"^,  however,  the  matrix  B 
is  distributed  in  a  block-column  fashion  whereas  the 
general  method  assumes  a  row-wise  access.  In  this 
case,  redistributing  T  would  hence  not  be  sufficient. 

5.3  Triangular  Systems  with  Multiple 
Right-Hand  Sides 

In  this  section,  orthogonal  data  structures  are  intro¬ 
duced  in  the  context  of  solving  some  basic  linear  sys¬ 
tems.  First,  triangular  systems  with  multiple  right- 
hand  sides  will  be  considered: 

B  — oT-^B 
B  ^oT-'^'b 
B  —  oBT"^ 

B  ^qBT-'^ 


Here  a  is  a  scalar,  B  €  Si"***"  ,  and  T  a  non-singular 
triangular  matrix.  Notice  how  both  the  inverse  (T”  * ) 
and  inverse  transpose  (T“^)  cases  are  considered  for 
T  providing  both  the  upper  and  the  lower  triangular 
cases. 

Since  triangular  solves  involve  either  forward  or 
backward  substitution  (both  inherently  sequential 
operation),  parallelization  is  not  as  straight-forward 
as  in  the  multiplication  cases.  However,  decent  par¬ 
allelization  can  be  achieved  by  using  a  "pipelined” 
approach,  as  described  by  [14],  where  the  data  is 
mapped  to  a  ring  structure.  In  this  case,  a  wrap- 
block-vector  data  distribution  since  it  provides  a  bet¬ 
ter  processor  utilization  in  the  factorization  stage 
[10]. 

It  was  recently  shown  that  these  other  BLAS-3 
subprograms  can  actually  be  implemented  in  terms 
of  GEMM,  at  least  in  the  sequential  setting  [15] 
with  reasonable  efficiency.  This  would  be  desirable 
in  the  parallel  setting  as  well,  since  it  would  reduce 
the  machine-depended  encodings  to  that  of  PGEMM. 
Futher  investigations  of  this  idea  are  currently  under 
considerations. 

6  Some  Implementation  Issues 
and  Future  Work 

The  Intel  iPSC/2  hypercube  is  currently  being  used 
as  a  test-bed  for  implementing  the  PBLAS  routines. 
The  ideas  behind  the  routines  are  not  ment  to  limit 
themselves  to  the  Intel  cube  or  its  topology  ,  but 
the  Intel  machine  is  rather  used  as  test  environment 
for  how  the  PBLAS  may  be  developed  f  r  common 
distributed  memory  systems.  As  mentioned.  It  is  our 
hope  that  the  PBLAS  can  become  a  standard  for  how 
core  matrix  algorithms  are  developed. 

We  chose  to  implement  our  routines  in  C.  Fortan 
may  seem,  to  many,  the  most  natural  language  to 
implement  matrix  algorithms  in.  However,  C,  with 
its  powerful  pointer  constructs  for  dynamic  memory 
allocation  and  strong  link  to  UNIX,  is  rapidly  becom¬ 
ing  more  popular.  Since  we  wanted  to  use  some  of  the 
C  pointer  features  in  the  implementation,  it  became 
a  natural  choice.  C  also  interfaces  well  with  Fortran 
and  is  along  with  Fortran  77  available  in  the  Intel 
hypercube.  (Fortran  90  may  provide  similar  features, 
but  is  yet  not  available  for  the  Intel  cube.) 

At  the  present,  PGEMM  has  been  partially  imple- 
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merited.  Current  work  includes  completing  most  of 
the  PGEMM  cases  (transpose  of  A  and  B,  various 
data  distributions,  etc  -  see  Section  3),  benchmark¬ 
ing  it,  and  developing  some  examples  of  application. 
Future  work  includes  the  implementation  of  some  of 
the  other  PBLAS  routines  cases  (quite  probably  with 
call  to  PGEMM).  Since  out  goal  is  to  demonstrate 
the  concepts  rather  than  provide  production  code, 
we  will,  for  now  limit  ourselves  to  a  couple  of  core 
cases  rather  than  provide  a  full  PBLAS  implementa¬ 
tion  for  the  Intel  hypercube  (left  as  future  work  for 
people  providing  production  code). 

6.1  PGEMM  on  the  hypercube 

Taking  a  closer  look  at  the  case  C  *-  C  -t-  A-Bina 
ring  setting,  we  decided  to  store  the  matrices  on  the 
nodes  in  one-dimensional  arrays.  These  are  then  used 
directly  as  message-buffers  during  the  communication 
phases  saving  valuable  storage  space  and  copy-time. 
To  emulate  2-dimensional  array  index,  index  func¬ 
tions  where  defined;  including  the  leading  dimension 
of  the  respective  matrix  (LDX): 

Indexing  function  for  A,  B  and  C  : 

This  indexing  function  assumes  matrices  stored-by- 
column  starting  at  array  location  A[l]  (FORTRAN- 
type). 

#define  AI»DEX(X,LDX,i, j)  \ 

X[(j-1)*LDX  +  i] 

The  block-column  version  o{C  *—C  +  A- B  can  be 
described  by  the  following  equation  for  block-vector 
Ci; 

Ci  =  AiBii  -f-  A2B21  -l" ...  -1-  ApBpi,  for  »,  j  =  1  :  p, 

where  p  is  the  number  ot  block-vectors  (n  =  r  •  p, 
where  r  is  block-width). 

Assuming  a  ring  embedding  using  the  Binary  Re¬ 
flective  Gray  Code  [17],  the  above  equation  leads  to 
the  algorithm: 

PGENN 

Let  me  =  position  in  ring 

(As  in  the  equation  above  where  node  i  holds  Cj  and 
Ai  locally.  Also  reflects  which  part  of  the  matrices 
are  stored  locally.) 

Compute  on  local  data: 


C_me  =  C_me  A.ne  *  B_me,me 

Following  the  MATLAB  notation  as  described  in 
Golub- Van  Loan  [11])  we  here  have  (all  local  sub¬ 
blocks): 

C_me  =  CCl:n,(me-l)r:me  ♦  r] 

A_me  =  A[l:n,(me-l)r:me  *  r] 

B_me,me  =  B[(me-l)r;me  ♦  r,l:r3 

(numnode  =  no.  of  nodes  in  cube  (ring)) 

A_tmp  =  A_me 

FOR  p  =  1  to  numnodes 

SEID  A_tmp  to  leXt-neighb 
RCV  A.tmp  Xrom  right-neighb 
C_me  =  C_me  +  A_tmp*B_x,me 

(i  is  a  function  of  numnodes  and  p  corresponding  to 
the  above  equation) 

Leave  result  on  nodes  or  SERB  to  host. 
EHD{PGEMM> 

Other  cases  will  be  described  in  future  work  along 
with  a  discussion  on  how  to  access  submatrices  (leav¬ 
ing  some  processor  idle  rather  than  redistributing 
data). 

7  Conclusions 

In  this  paper,  a  basic  set  of  linear  algebra  algorithms 
in  the  spirit  of  BLAS  [4,3]  were  proposed  to  form 
a  basis  for  algorithmic  development  also  in  the  dis¬ 
tributed  memory  environment. 

Extra  parameters  needed  for  the  parallel  environ¬ 
ment  were  identified  and  £kdded  to  the  BLAS3  calling 
sequences.  These  parameters  included  a  parameter 
to  describe  the  network  topology  and  parameters  for 
specifying  the  input  and  output  distribution  of  the 
data.  Powerful  communication  structures  (such  as 
the  orthogonal  data  structures  involving  trees  and 
meshes)  could  then  be  hidden  within  the  routines 
sparing  the  users  from  hardware  details. 

It  is  the  hope  that  sometime  in  the  future  the  ideas 
behind  these  routines  not  only  will  provide  a  stan¬ 
dard  for  parallel  library  builders,  but  that  similar  op¬ 
timized  routines  also  become  standard  parts  of  future 
languages  or  operating  system  kernels. 
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Abstract 

A  set  of  routines  has  been  written  for  dense  ma¬ 
trix  operations  optimized  for  the  NCUBE/6400  par¬ 
allel  processor.  This  work  was  motivated  by  a  San¬ 
dia  effort  to  parallelize  certain  electronic  structure 
calculations  [1].  Houtines  are  included  for  matrix 
transpose,  multiply,  Cholesky  decomposition,  trian¬ 
gular  inversion,  and  Householder  tridiagonalization. 
The  library  is  written  in  C  and  is  callable  from  For¬ 
tran.  Matrices  up  to  order  1600  can  be  handled  on 
128  processors.  For  each  operation,  the  algorithm 
used  is  presented  along  with  typical  timings  and  es¬ 
timates  of  performance.  Performance  for  order  1600 
on  128  processors  varies  from  42  MFLOPs  (House¬ 
holder  tridiagonalization,  triangular  inverse)  up  to 
126  MFLOPs  (matrix  multiply).  We  also  present 
performance  results  for  communications  and  basic 
linear  algebra  operations  (saxpy  and  dot  products). 


Introduction. 

This  paper  describes  the  implementation  of  rou¬ 
tines  for  dense  linear  algebra  on  the  NCUBE/6400 
hypercube.  The  primary  purpose  of  these  routines 
is  for  electronic  structure  calculations,  and  empha¬ 
sis  has  been  placed  on  routines  for  the  generalized 
symmetric  eigenvalue  problem,  although  we  intend 
to  add  routines  for  linear  solutions  and  SVD. 

A  coarse-grain  MIMD  machine  such  as  the 
NCUBE  might  not  be  expected  to  perform  partic¬ 
ularly  well  on  linear  algebra  problems  when  com¬ 
pared  with  SIMD  machines  (e.g.  the  Connection 
Machine)  or  vector  machines  like  the  CRAY-XMP 
and  its  successors.  However,  electronic  structure 
programs  spend  most  of  their  time  computing  ma¬ 
trix  elements,  a  task  which  is  very  well  suited  to 
the  MIMD  architecture.  In  order  to  retain  this  ad¬ 
vantage,  it  is  important  to  have  a  library  of  linear 

'This  work  wm  performed  at  Sandia  National  Laborato¬ 
ries  which  is  operated  for  the  U.S.  Department  of  Energy 
under  contract  number  DE-AC04-76DP00789. 


algebra  software  with  good  performance.  This  was 
the  object  of  the  work  presented  here. 

The  NCUBE/6400  hypercube  can  be  configured 
with  up  to  8192  processors  (nodes).  Sandia  cur¬ 
rently  has  two  of  these  systems,  one  with  8  nodes 
and  4MB  of  memory  per  node  and  the  other  with 
128  nodes  with  1MB  of  memory  per  node.  The  latter 
system  is  being  upgraded  to  1024  nodes  with  4MB 
per  node.  The  NCUBE/6400  uses  a  second  gen¬ 
eration  NCUBE  CPU  chip  that  has  some  architec¬ 
tural  differences  with  the  NCUBE/ten  CPU  chips. 
The  measured  floating  point  and  communications 
parameters  of  the  new  NCUBE  are  described  below. 
We  use  the  notation  NCUBEI-II  to  refer  to  this  pro¬ 
cessor  with  the  initial  software  release,  clock  cycle 
time  (20MHz),  and  memory  wait  states.  Later  ver¬ 
sions  of  the  ‘NCUBE/6400’  might  be  released  with 
rather  different  properties.  We  also  use  the  notation 
NCUBEJ-I  to  refer  to  processors  of  the  NCUBE/ten 
and  other  first  generation  NCUBE  systems. 

In  the  following,  we  present  results  for  the  basic 
floating  point  and  communications  performance  of 
the  NCUBEJ-II.  We  then  describe  the  basic  opera¬ 
tions  such  as  transpose,  mapping,  and  matrix  mul¬ 
tiply.  The  performance  of  the  Cholesky  factoriza¬ 
tion  and  inverse  of  a  triangular  matrix  are  presented 
next,  followed  by  the  results  for  Householder  tridi¬ 
agonalization. 


NCUBri!)-II  floating  point  and 
communications  parameters. 

Linear  algebra  software  relies  heavily  on  a  few 
simple  kernels  [2]  and  it  is  important  to  optimize 
these  carefully  for  a  particular  machine  architecture. 
The  NCUBE-II  has  a  simple  memory  hierarchy  [4] 
with  only  main  memory  and  an  instruction  cache. 
There  are  no  vector  instructions  or  vector  registers. 
The  C  compiler  is  capable  of  using  only  a  small  num¬ 
ber  of  registers  for  inner  loops  and  only  if  explicitly 
assigned  by  the  user.  We  therefore  chose  not  to  im¬ 
plement  any  level-2  or  level-3  BLAS  operations. 

Table  1  shows  timing  results  for  dot  product  and 
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Table  1:  Measured  floating  point  times  for  dot  prod¬ 
ucts  and  saxpy  operations  on  a  single  NCUBE-II 
processor.  Times  eure  in  microseconds  per  step  for  a 
vector  length  of  100.  The  labels  r4,r8  refer  to  single 
precision  real,  double  precision  real,  and  c4,c8  refer 
to  single  precision  complex,  double  precision  com¬ 
plex.  Single  precision  is  4  byte  IEEE  floating  point 
and  double  precision  is  8  byte  IEEE  floating  point. 


Function 

Time/step 

MFLOPs 

dot,r4 

1.66 

1.20 

dot,r8 

1.74 

1.15 

dot,c4 

6.13 

1.31 

dot,c8 

5.21 

1.54 

saxpy, r4 

1.77 

1.13 

saxpy,  r8 

1.66 

1.20 

saxpy,  c4 

7.06 

1.13 

saxpy,c8 

6.26 

1.28 

saxpy  operations.  The  single  precision  dot  product 
routine  accumulates  the  result  in  double  precision, 
and  requires  2  flops  and  2  memory  references  per 
loop  step,  plus  a  single  to  double  conversion.  The 
saxpy  operation  requires  2  flops  and  3  memory  refer¬ 
ences  per  step.  The  saxpy  operations  are  therefore 
a  little  faster  than  one  might  expect  based  on  the 
number  of  memory  references.  The  complex  ver¬ 
sions  of  these  operations  require  8  flops  per  step. 
Using  these  loops  we  can  define  a  floating  point  op¬ 
eration  time  Tj  which  is  an  average  of  the  dot  or 
saocpy  times. 

The  kernel  routines  are  written  in  C,  with  several 
optimizations  which  contribute  significantly  to  the 
speed.  Registers  are  carefully  assigned  to  loop  in¬ 
dices  and  pointers.  Pointer  arithmetic  is  used  rather 
than  array  references.  The  loops  are  decremented 
(n  to  1)  rather  than  incremented.  There  is  some 
loop  unrolling.  These  optimizations  generally  give 
a  speedup  of  2-3  times  that  of  naive  code.  We  es¬ 
timate  that  future  compiler  optimimizations  and/or 
assembly  coding  of  these  routines  may  give  improve¬ 
ments  of  20%-50%.  We  note  that  these  C  routines 
are  about  12  times  faster  than  hand  coded  assembler 
versions  written  for  the  NCUBE^l. 

We  have  also  measured  the  communication  pa¬ 
rameters  To  for  startup  time  and  Tg  for  transfer  time 
per  byte.  This  is  done  with  a  program  which  runs 
on  a  pair  of  nodes  which  exchange  a  large  number  of 
messages.  Different  tests  varied  the  length  /  of  the 
messages  from  1  to  1000  bytes  and  the  results  were 
fitted  to  an  expression  of  the  form  T  =  a  +  lb.  Devi¬ 
ations  from  the  straight  line  fit  were  about  5%.  The 


Table  2:  NCUBE-II  floating  point  and  communica¬ 
tions  parameters.  All  times  are  in  microseconds. 


Single  precision  flop  time  tj  .84 

Double  precision  flop  time  tj  .77 

Message  startup  time.  To  151 

Message  time/byte,  Tg _ .37 


J 

measurement  does  not  distinguish  between  times  for 
message  read  and  message  write,  so  we  take  To  =  a/2 
and  Tg  =  6/2. 

Table  2  summarizes  the  measured  floating  point 
Tj  and  communicatio  i  parameters  for  the  NCUBE- 
II.  Note  the  ri  'ler  '^rge  value  relative  to  tj  or 
Tg.  Independent  measurements  using  Fortran  give 
To  =  200  and  Tg  =  .4  [5]. 


Matrix  organization. 

All  of  the  routines  described  here  allocate  entire 
columns  of  the  matrix  to  the  processors.  Routines 
are  provided  to  map  between  different  matrix  orga¬ 
nizations  (columns  blocked  or  wrapped,  and  either 
binary  or  Gray-code  imbedding).  Thus,  algorithms 
which  use  any  of  these  organizations  can  be  imple¬ 
mented. 

An  alternative  to  column  organization  is  mesh  or¬ 
ganization,  where  the  matrix  is  imbedded  in  a  2D 
mesh  of  processors.  Routines  have  been  written  to 
map  between  column  and  mesh  organizations,  but 
none  of  the  algorithms  presented  here  use  mesh  or¬ 
ganizations. 


Basic  matrix  routines. 

Basic  routines  include  matrix  setup,  mapping  rou¬ 
tines,  matrix  transpose,  matrix  norms,  and  user 
functions  (i.e.,  apply  a  user  defined  function  to  all 
elements  of  the  matrix).  Timings  for  the  mapping 
routines  are  shown  in  Table  3  for  a  matrix  of  size 
1500  on  64  processors.  These  routines  have  a  typi¬ 
cal  complexity  of  n*  log  P/P  where  n  is  the  matrix 
order  and  P  is  the  number  of  processors  [6]. 

The  matrix  transpose  has  a  complexity  similar  to 
the  above  routines.  Times  are  .48  seconds  for  n  = 
1024,  P  =  64  and  1.04  seconds  for  n  =  1500,  P  =  64. 


Matrix  multiply. 

Matrix  multiply  is  implemented  as  a  matrix  trans¬ 
pose  followed  by  column-wise  dot  products.  The 
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Table  3;  Timings  in  seconds  for  mapping  routines. 
N  refers  to  natural  or  binary  processor  ordering,  G 
refers  to  Gray-code  processor  ordering,  B  refers  to 
columns  blocked,  and  W  refers  to  columns  wrapped. 
Timings  are  for  64  processors  and  matrix  size  1500 
by  1500. 


N,  B 

N,  W 

G,  B 

G,  W 

N,  B 

- 

.42 

.041 

.8 

N,  W 

.42 

- 

.78 

.001 

G,  B 

.041 

.78 

- 

1.2 

G,  W 

.8 

.001 

1.2 

- 

Table  4:  Timings  in  seconds  for  matrix  multiply. 
MFLOP  rates  follow  in  parentheses  and  are  com¬ 
puted  from  the  serial  complexity  of  the  algorithm, 
2n^  flops. 


N 

P=64 

P=128 

128 

.15  (27) 

.11  (36) 

256 

.72  (44) 

.45  (71) 

512 

4.70  (57) 

2.63  (102) 

1024 

32.9  (65) 

17.37  (123) 

1300 

68.0  (65) 

37.42  (117) 

1600 

- 

64.8  (126) 

multiply  algorithm  uses  a  (Gray, block)  organization 
and  calls  the  mapping  routine  if  necessary.  The  left 
hand  factor  is  transposed  and  cycled  around  the 
imbedded  ring  of  processors.  Table  4  shows  tim¬ 
ings  and  MFLOP  rates  for  P  =  64  and  P  =  128  for 
various  matrix  sizes. 

Cholesky  factorization  and  triangular 
inverse. 

The  Cholesky  factorization  algorithm  is  similar  to 
that  of  Geist,  ei  al.  [8].  The  algorithm  implemented 
here  computes  the  factorization  A  =  U^U  where  U 
is  upper  triangular;  it  is  therefore  a  column  version 
of  Geist ’s  row  algorithm.  The  (Gray,wrap)  mapping 
is  used. 

In  contrast  to  Cholesky  factorization,  the  algo¬ 
rithm  to  compute  the  inverse  of  a  triangular  ma¬ 
trix  uses  a  (Gray, block)  mapping.  The  wrap  version 
of  this  algorithm  incurs  very  large  communication 
costs.  Even  the  block  version  must  be  modified  to 
prevent  the  message  traffic  along  the  imbedded  ring 
from  clogging  up.  The  NCUBE  software  provides 
buffered  asynchronous  communications,  and  the  tri- 


Table  5:  Timings  in  seconds  for  Cholesky  factor¬ 
ization  and  triangular  inverse.  MFLOPs  follow  in 
parentheses  and  are  computed  from  the  serial  com¬ 
plexity  of  the  algorithms,  n^/3  in  both  cases. 


Cholesky  factorization. 


N 

II 

cu 

II 

00 

128 

.15  (4.7) 

.14  (5.0) 

256 

.46  (12.) 

.40  (14.) 

512 

1.87  (24) 

1.45  (30) 

1024 

9.38  (38) 

6.45  (56) 

1300 

17.2  (43) 

11.4  (64.2) 

1600 

- 

18.7  (73) 

IViangular  inverse. 


N 

P=64 

II 

00 

128 

.10  (7.0) 

.09  (7.8) 

256 

.41  (14) 

.27  (21) 

512 

2.37  (19) 

1.38  (32) 

1024 

16.25  (22) 

8.81  (41) 

1300 

33.4  (22) 

18.5  (39.5) 

1600 

- 

32.4  (42) 

angular  inverse  algorithm  begins  at  the  last  column 
by  computing  a  small  amount  and  then  sending  a 
large  column  to  the  left.  Therefore,  without  sychro- 
nization,  the  buffer  memory  of  a  node  down  the  ring 
quickly  fills  up  and  the  software  does  not  recover 
from  this  state.  A  simple  fix  is  to  synchronize  each 
stage  of  the  ring  after  some  number  of  bytes  (chosen 
to  be  a  fraction  of  the  available  buffer  memory)  have 
been  petssed. 

Table  5  summarizes  the  timing  results  for 
Cholesky  factorization  and  triangular  inverse. 

Householder  tridiagonalization. 

Unlike  Householder  orthogonalization.  House¬ 
holder  tridiagonalization  cannot  be  pipelined  [7]. 
This  forces  the  use  of  broadcast  communications. 
Moreover,  there  is  a  significant  serial  component  to 
the  algorithm.  The  algorithm  proceeds  as  follows. 
The  matrix  is  distributed  using  the  (natural,  block) 
mapping.  For  c  =  1  to  n  —  2  we  then  execute  the 
following: 

1.  On  the  processor  containing  column  c,  compute 
the  Householder  vector  u.  This  requires  about 
3i  flops,  where  i=n  —  c— 1,  so  the  time  for 
step  1  is  Ti  =  ‘s  part  of  the  serial 

overhead. 
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2.  Broadcast  u  to  all  processors.  This  takes  time 
21og(P)(r<,  +  4ire)  for  each  step,  so  Ta  = 
2n  log(/’)(ro  +  2nr<.). 

3.  Compute  w  =  Au  in  parallel.  Each  processor 
computes  its  share  of  w  using  its  share  of  the 
columns  of  A.  The  total  time  for  this  step  is 
t3  =  i«s/p. 

4.  Compute  the  inner  product  of  «  with  u  and 
u  with  w.  The  calculation  time  for  this  is 
small,  but  the  results  must  be  exchanged  among 
all  processors.  The  time  is  therefore  T4  = 
2n  log(P)ro. 

5.  Combine  g  =  au  +  0w  where  a  and  j3  are  com¬ 
puted  from  the  inner  products  above.  The  cal¬ 
culation  is  done  in  parallel  and  is  small,  but  the 
resulting  g  must  be  combined  on  all  nodes.  The 
time  for  this  is  T5  =  2n  log( P)(ro  -f  2nre). 

6.  Compute  =  A^*)  +  txq^  +  qu^  in  paral¬ 

lel.  The  time  for  this  is  Te  =  ^n^Tj/P  if  no 
advantage  is  taken  of  symmetry. 

Summing  the  contributions  for  n  —  2  columns,  the 
total  time  is 

T  =  (2r//F)n®-h(^r/+81og(P)re)n^+(61og(P)ro)n 

for  single  precision  floating  point,  where  we  have 
kept  only  the  leading  terms.  Using  this  expression 
and  the  parameters  from  Table  2,  we  obtain  11.36 
seconds  for  n  =  512,  P  =  64  and  54.6  seconds  for 
n  =  1024,  P  =  64.  The  actual  results  are  shown  in 
Table  6;  agreement  is  good.  The  biggest  part  of  the 
overhead  is  due  to  communications.  A  fairly  sim¬ 
ple  modification  of  the  above  algorithm  reduces  the 
terms  proportional  to  log(P)  by  a  factor  of  2.  This 
is  done  by  running  the  algorithm  backwards  from 
c  =  n  to  2  and  collapsing  the  broadcast  and  com¬ 
bine  operations  to  smaller  subcubes  as  the  calcula¬ 
tion  progresses.  This  modification  has  been  tested 
on  the  NCUBE-I  but  has  not  yet  been  implemented 
on  the  NCUBE-II. 

Conclusion.^. 

This  paper  has  presented  algorithms  and  per¬ 
formance  for  a  library  of  dense  matrix  linear  al¬ 
gebra  routines.  The  library  is  optimized  for  the 
NCUBE/6400,  a  new-generation  .vfIMD  hypercube. 
Performance  for  dot  products  and  saxpy  operations 
varies  from  1.1  to  1.5  MFLOPs  on  a  single  node. 


Table  6:  Timings  in  seconds  for  Householder  tridi- 
agonalization.  MFLOPs  follow  in  parentheses  and 
are  computed  from  the  serial  complexity  of  the  al¬ 
gorithm,  |n®  flops. 


N 

P=64  P=128 

128 

256 

512 

1024 

1600 

.414(6.7)  1.06(2.6) 

2.88  (7.3)  2.80  (7.5) 

11.47  (16)  9.28  (19) 
61.8  (23)  41.  (35) 

128.  (43) 

The  results  for  up  to  128  processors  are  encourag¬ 
ing:  for  n  =  1600  we  get  performance  varying  from 
42  MFLOPs  (Householder  tridiagonalization,  trian¬ 
gular  inverse)  up  to  126  MFLOPs  (matrix  multi¬ 
ply).  These  results  are  encouraging  for  our  elec¬ 
tronic  structure  calculations. 
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Abstract 

There  is  often  a  trade-off  between  preserving  spar¬ 
sity  and  numerical  stability  in  sparse  matrix  factor¬ 
izations.  In  applications  like  the  direct  solution  of 
Equality  Constrained  Least  Squares  problem,  the  ac¬ 
curate  detection  of  the  rank  of  a  large  and  sparse  con¬ 
straint  matrix  is  a  key  issue.  Column  pivoting  is  not 
suitable  for  distributed  memory  machines  because  it 
forces  the  program  into  a  lock-step  mode,  preventing 
any  overlapping  of  computations.  So  factorization  al¬ 
gorithms  on  such  machines  need  to  use  a  reliable,  yet 
inexpensive  incremental  condition  estimator  to  decide 
on  which  columns  to  be  included.  We  describe  an  in¬ 
cremental  condition  estimator  that  can  be  used  during 
a  sparse  QB.  fhctorization.  We  show  that  it  is  quite  re¬ 
liable  and  is  well  suited  for  use  on  parallel  machines. 
We  supply  experimental  results  to  support  its  effec¬ 
tiveness  as  well  as  suitability  for  parallel  architectures. 

1.  Introduction 

Choosing  a  set  of  linearly  independent  columns  from 
a  given  matrix,  within  a  tolerance  of  machine  preci¬ 
sion,  is  a  common  subproblem,  among  problems  in¬ 
volving  matrix  computations.  Subtle  variations  of 
the  same  problem  are  “rank  detection”  and  “condi¬ 
tion  estimation”. 

Traditional  methods  of  rank  detection  for  dense 
matrices  include  QR  factorization  with  column  piv¬ 
oting  [9],  Singular  Value  Decomposition  [9]  and  a 
host  of  condition  estimators  [11],  the  most  popular 
among  them  being  the  LINPACK  1-norm  estimator. 
Threshold  pivoting  [10]  strategy  is  often  used  in  the 
case  of  sparse  matrices. 

’Research  supported  by  the  National  Science  Foundation 
under  grant  no.  CCR-8700172,  the  Air  Force  Office  of  Scien¬ 
tific  Research  under  grant  no.  AFOSR-88-OI61,  and  the  Office 
of  Naval  Research  under  grant  no.  NOOl  4-80-051 7. 
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However,  when  we  consider  solving  these  problems 
on  a  parallel  architectures,  most  of  the  traditional 
approaches  fail  to  be  cost  effective,  especially  when 
large  and  sparse  matrices  are  involved. 

In  this  paper,  we  propose  an  incremental  condition 
estimator,  which  is  quite  reliable  and  is  well  suited  for 
parallel  sparse  matrix  QR  factorizations.  In  section 
2,  we  examine  the  reasons  for  the  failure  of  traditional 
methods  when  applied  to  our  problem.  In  section  3 
we  discuss  the  issues  in  the  effective  implementation 
of  solving  our  problem  on  a  parallel  architecture.  In 
section  4,  we  describe  the  an  algorithm  that  allows 
us  to  incrementally  estimate  the  condition  number 
of  the  triangular  factor  during  the  factorization.  In 
section  5,  we  discuss  the  implementation  issues  on 
a  parallel  architecture  and  provide  experimental  re¬ 
sults.  In  section  6,  we  provide  experimental  evidence 
that  suggests  that  the  algorithm  is  robust  enough. 

2.  Failure  of  Traditional  Methods 

The  general  strategy  for  doing  a  QR  factorization  of 
a  sparse  matrix  C  is  [6] 

1.  Determine  the  symbolic  structure  ofC^C. 

2.  Using  a  heuristic  approach,  find  a  permutation 
matrix  P,  such  that  P^C^CP  has  a  sparse 
cholesky  factor. 

3.  Generate  the  storage  structure  for  R  by  doing  a 
symbolic  factorization  of  P'^C^CP. 

4.  Compute  R  numerically. 

Although  it  is  known  that  finding  a  permutation  in 
step  2,  that  produces  an  optimally  sparse  Cholesky 
factor  is  a  hard  problem  (in  fact  NP-hard),  many 
good  heuristic  approaches  such  as  minimum  degree 
and  nested  dissection  give  us  fill-reducing  orderings 
[7].  This  approach  of  determining  the  data  structures 
required  for  the  R  factor  before  the  actual  factoriza¬ 
tion  (in  other  words  a  static  data  structure)  has  some 
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advantages,  compared  to  dynamically  set  up  storage 
structures  during  the  factorization.  The  accessing  of 
the  elements  in  a  static  set  up  is  faster  and  hence 
the  factorization  step  is  likely  to  be  faster.  Since  the 
static  structure  does  not  depend  on  the  numerical 
values  of  the  original  matrix  C,  the  cost  involved  in 
steps  1 — 3  can  be  spread  over  a  number  of  factoriza¬ 
tions  if  repeated  computations  of  R  are  required  with 
different  numerical  values  of  C. 

Most  of  the  known  algorithms  for  rank  detection 
(or  condition  estimation)  are  neither  cost  effective 
nor  appropriate  for  sparse  matrix  applications.  Most 
of  the  estimators,  surveyed  by  Higham  [11],  require 
O(n^)  units  of  computation  time.  This  is  too  expen¬ 
sive  for  sparse  matrices,  considering  that  the  factor¬ 
ization  of  a  sparse  matrix  itself  requires  only  0(n^  ®). 
The  QR  factorization  with  column  pivoting  upsets 
the  sparsity  pattern,  because  the  column  ordering 
chosen  in  step  2  is  not  used.  Moreover  the  pivoting 
process  requires  us  to  use  a  dynamic  data  structure 
for  R.  Singular  Value  Decomposition  is  too  expensive 
for  practical  use,  even  though  it  is  the  most  accurate 
algorithm  for  rank  detection. 


3.  Issues  in  Parallel  Factorization 

In  this  paper,  we  limit  our  discussion  to  distributed 
memory  machines,  such  as  Hypercubes,  while  talking 
about  parallel  architectures.  If  we  want  to  implement 
the  factorization  in  parallel,  we  need  to  re-examine 
the  validity  of  the  traditional  methods  on  such  ma¬ 
chines.  The  colunui  pivoting  algorithm  requires  that 
the  processors  have  to  synchronize  to  select  the  next 
pivot  column.  This  introduces  not  only  delays  due  to 
communication  overheads  but  also  forces  the  program 
into  a  lock-step  mode,  leaving  no  room  for  pipelining 
and  /  or  overlapping  of  computations.  As  was  ob¬ 
served  already,  any  pivoting  process  results  in  more 
fill-in  and  hence  more  computation  time. 

Dynamic  data  structures  are  not  easy  to  distribute 
in  a  local-memory  environment  ;  even  if  we  manage 
to  do  that,  keeping  track  of  the  current  state  of  the 
structure  among  all  processors  is  not  an  easy  task. 
Hence  we  consider  using  static  data  structures.  The 
threshold  strategy  described  by  Heath  [10]  and  imple¬ 
mented  in  SPARSPAK-B  [8]  allows  to  deal  with  the 
static  data  structures  for  most  of  the  computations. 
Even  though  empirical  tests  show  that  this  strategy 
rarely  fails  in  practice,  dramatic  failures  in  rank  de¬ 
tection  are  possible  in  some  cases.  A  simple  exam¬ 
ple  is  the  following  bidiagonal  matrix,  which  will  be 
considered  full  rank  matrix  for  any  value  of  a,  even 
though  D  could  be  arbitrarily  ill-conditioned.  (In 


fact,  actual  k{D) 

fa  a^). 

/I  a 

0  ... 

0  1 

a 

0 

D  = 

0  0 

a 

0 

0  ... 

1 

a 

\0  ... 

0 

1/ 

If  we  use  static  data  structures,  during  the  factor¬ 
ization,  we  are  only  allowed  to  look  at  each  column 
only  once  in  a  given  sequence  and  we  should  be  able 
to  determine  whether  a  new  colunm  is  Unearly  in¬ 
dependent  of  the  others  already  chosen  to  be  in  the 
factor.  This  translates  to  checking  whether  the  re¬ 
sulting  upper  triangular  factor  is  going  to  be  well 
conditioned. 

To  this  end,  Bischof  [3]  describes  an  incremental 
estimator  for  the  smallest  singular  value,  which  is 
a  modified  2-norm  condition  estimator  suggested  by 
Cline,  Conn  and  Van  Loan  [4,13].  However  this  al¬ 
gorithm  has  two  serious  drawbacks  when  it  comes 
to  sparse  matrices.  Firstly  the  estimator  requires 
fiops  during  the  triangularization  of  an  n  x  n  matrix. 
Secondly,  its  estimate  of  the  smallest  singular  value 
differs  arbitrarily  from  the  actual  value,  for  matrices 
with  special  structure.  In  particular,  if  the  new  row 
being  added  is  orthogonal  to  the  current  approximate 
singular  vector,  then  the  estimate  is  likely  to  be  very 
poor.  As  an  example,  consider  the  following  3x3 
matrix. 


D  = 


0 

0 

2 


For  a  specific  value  of  w  =  1  —  V5,  the  estimate  of 
the  smallest  singular  value  from  Bischof’s  algorithm 
is  1,  independent  of  /?,  while  the  actual  2-norm  of 
£)  s»  We  are  able  to  construct  such  trivial  3  x 
3  examples,  because  there  is  no  look-ahead  in  the 
estimation  algorithm. 

Recently  Bischof,  Lewis  and  Pierce  [2]  extended 
the  original  algorithm  to  handle  the  case  of  general 
matrices  and  showed  how  this  modified  approach  can 
be  used  for  nested  dissection  case. 


4.  Incremental  Condition  Estimation 

The  proposed  algorithm  is  an  “incremental  oo-norm” 
estimator  that  uses  look-ahead.  It  is  a  modification 
of  LINPACK  1-norm  estimator  for  upper  triangular 
matrices  [5].  The  algorithm  looks  at  each  column  just 
once  and  decides  whether  to  include  that  column  in 
the  factorization  or  not.  It  does  that  by  incrementally 
estimating  the  oo-norm  of  the  inverse  of  the  partially 
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formed  upper  triangular  factor.  There  is  no  column 
pivoting  and  hence  static  data  structures  can  be  used. 

We  are  interested  in  computing  a  QR  factorization 
of  the  matrix  C,  with  accurate  rank  detection,  so  that 
the  factored  matrix  has  the  following  form 


C  =  Q 


f/ll  Ui2 
0  U22 


Define  the  sequence  of  upper  triangular  matrices 
t/P\ib=l,2....,/,by 


=  where  uii  =  ||ci||2 


and 


where 


U 


(t) 


V(k+1) 
7fc+i  J 


_  (c(^)  J*)  \T 

.(fc)  _/.(*)  Jk)  XT 

‘^t+1  ~  tCl.i  +  ll-  -iC,.,,,  fc  +  lJ 

is  the  (Jfc  +  1)*‘  column  of  C  after  Hi,H2,  ■  ■  .,Hk  are 
applied  and 


7fc+i  = 


To  incorporate  a  "look-ahead”,  we  consider  the 
partial  sums 


and  pj  =  Vj 


i(*) 

I  ^t+i 


where 

and 

and 


ft+i  =  7r+i(i  - 
ik+i  =  7r+i(-i  - 


is  the  /*'  colunui  of  C  after  Hi,H2,  ■■■,Hk  are  ap¬ 
plied.  Note  that  the  last  entry  of  Vj  is  not  known 
until  after  we  use  column  k  to  form  Hk-  The  pj  can 
be  au:cumulated  through  out  the  computation.  For 
Weights  ti, <21  •  •  •  1  tn  >  0,  we  then  examine 


and 


and  choose 


c+  =  lf?+il+  E  ^iPt 

J=i+1 


C  —  l^t+il+  ^iPj 

i=fc+i 


—  \/ll<^*+ili2  ~  ii^fc+iiiz 
Let 

L(k)  ^  fc|(*)]r 

and 

a(l)  =  1; 

d-i  =  l/«ii  =  l/||ci||2  =  l/7i- 

To  choose  the  (ifc  +  1)®‘  column,  we  let  be  such 
that 

£,(k)x(k)  —  q(*) 

where  is  a  vector  of  ±1,  chosen  to  maximize 
j|j.(t+i)||^  Then  compute 

x(‘+i)  =  (x(*),6+ir 

where 

6+1  =  7jr+i(-«</«(vt+i2:^*^)  - 

Thus 

dt+i  =  max{dt,|^ifc+i|}  =  Hx^^+'^Hoo- 

This  procedure  is  precisely  the  LINPACK  estimator 
without  the  “look-ahead”  property. 


x(*+>)  =  (x(*),^+,,)’’ 
or 

x(‘+')  =  (x(‘),^,-+if 

according  to  whether  C*"  or  C“  is  larger.  The  choice 
of  weights  is  heuristic.  LINPACK  chooses  tj  =  u^j. 
However,  we  have  not  computed  Ujj  at  this  point. 
So  we  choose  tj  =  . 

We  now  explain  how  this  can  be  used  in  column 
selection.  Our  algorithm  performs  the  condition  es¬ 
timator  on  the  most  recently  formed  diagonal  block 
Cii  until 

1.  it  finds  a  zero  diagonal 

2.  the  estimate  of  ||C,7*|1  exceeds  e~^  where  e  is  a 
predefined  tolerance  and  is  usually  0{p),  p  being 
the  machine  precision. 

In  both  cases,  we  restart  the  condition  estimator 
with  all  Pi  =  0  (implicitly)  and  then  begin  form¬ 
ing  Ci+1,,+1.  Of  course,  in  case  2,  we  must  find  a 
dependent  column  in  Cj,.  To  do  this,  we  solve 

Oi'i/l  —  €k . 

Let  1/  be  the  index  such  that 

\hJ  =  max  Ihil. 

!.<><* 
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And  we  delete  the  column  u  from  C„  and  re- 
triangularize  the  new  matrix  by  a  sequence  of  Given’s 
rotations.  The  general  idea  of  the  algorithm  is  de¬ 
tailed  below. 

/*  e  as  /i  is  a  tolerance  factor. 

denotes  the  row  of  C.  */ 

/  ^0 

done  *—  false 

k^l 

firstk  «—  1 

firstl  *—  1 

qi  4-  »,  i  =  1,2, . .  .n 

Ti  ^  ||c,lPi  J  =  1,2, ..  .n 
<7  4—0 

while  not  done  do 
^  *-yk^(-si9n(pk)-  Pk) 
d  4- mai{<T,  141} 
if  >  f  then 

construct  an  orthogonal  transformation 
such  that  the  current  column  is  zeroed 
/*  this  column  is  good  */ 

1  4—  1  -f-  1  and  qi  *—  k 
for  j  €  Nonz(c^'^) 

update  the  column  norms  yj 

-  Pk) 

CXax  ^ 

Cmar  *  1^  I 

for  j  €  lVon2(cf*I)  update  the  partial  sums 

pf  4-  pj  + 

Pj  ^  Pj+  C(+i4- 

Cmax  *  ^®®{Cmor’  \Pj  /TjI} 

Cmar  <-  "»a*{Cmar.  \Pj  HjW 
/*  We  just  looked  ahead  of  the  affected  columns 
and  computed  both  possible  values  for 
the  partial  products  */ 
if  Cmar  >  Cmar  then 
<7  4-max{<7, 14+1} 

Pj^P'j’  jeNom{cV^) 
else 

<7  4-  max{<7,  (4“(} 

Pi  P~i  -  3  e  Nom{c^‘^) 

else  the  column  is  not  good 
if  7mar  <  f  then 

/*  no  more  good  columns  and  we  are  done  */ 
exit 
else 

Let  Cii  be  the  submatrix  of  C 
with  rows  from  firstl  through  / 
and  columns  from  firstk  through  k. 


Solve  a.h  =  (0,0,...,!)^.  ■ 

Let  Ihvl  =  max(|hj|) 

t 

/*  Break  ties  arbitrarily  in  choosing 
the  maodmum  element  */ 

Move  the  column  i/  to  the  last  column  of  Ca 
and  re-triangularize  Cu. 
firstl  4—  1 
firstk  4—  1 

Pi  4—0  /*  starting  over  a  new  block  */ 

endif 
endif 

i  4—  /[;  -I- 1 

done  ♦—  (/  >  mi  or  k>  n) 
endwhile 

Figure  1:  Factorization  Algorithm  with  the  Incre¬ 
mental  Condition  Estimator 

As  soon  as  a  bad  column  is  encountered,  the  ma¬ 
trix  being  factored  will  have  the  appearance  as  shown 
below. 


Since  a  back-solve  and  a  re-triangularization  is 
done  every  time  a  bad  column  is  encountered,  we  need 
a  column  which  is  a  full  vector.  But  fortunately,  we 
can  just  reserve  one  vector  (whose  size  is  equal  to  the 
number  of  rows)  to  store  the  intermediate  computa¬ 
tions.  The  typical  structure  of  the  matrix  after  the 
fectorization  is  shown  below. 


The  final  upper  trapezoidal  form  of  C  may  have 
the  following  form. 


/  Cii  Ci2  Ci3  Ci4  \ 

I  0  C22  <^23  C24  I 

\  0  0  C33  C34  / 

where  C„, »  =  1, 2, 3  have  full  row  rank. 
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5.  Issues  in  Parallel  Implementation 

The  problem  is  to  detect  the  rank  of  a  large  and 
sparse  matrix  C  and  obtain  a  QR  factorization  of 
that  matrix.  As  was  described  in  section  2,  the  gen¬ 
eral  strategy  is  followed.  We  use  SPARSPAK-B  [8] 
for  doing  steps  1-3  as  described  in  section  2.  Form 
the  storage  structure  provided  for  R  by  SPARSPAK, 
we  generate  the  static  structure  required  for  the  fac¬ 
torization.  Since  this  work  involves  only  the  symbolic 
structure  of  C  and  is  a  well  understood  problem,  we 
perform  only  the  numerical  factorization  part  on  the 
parallel  machine. 

We  consider  issues  in  implementing  this  algorithm 
on  a  Hypercube  architecture.  In  a  rather  straightfor¬ 
ward  way,  the  columns  of  the  matrix  C  are  wrapped 
around  among  the  processors  on  the  hypercube.  For 
the  sake  of  simplicity,  we  may  assume  that  the  pro¬ 
cessors  form  a  ring,  although  for  “broadcast”  pur¬ 
poses,  other  connections  of  the  hypercube  are  implic¬ 
itly  made  use  of.  Each  processor  makes  a  decision  as 
to  include  the  next  column  in  the  factorization  and 
sends  a  message  to  other  processors  along  with  the 
necessary  transformations(  if  the  column  is  included). 
The  updating  of  the  rest  of  the  columns  on  the  same 
processor  is  done  only  after  sending  the  information 
to  other  processors. 

There  are  a  couple  of  obvious  bottlenecks  to  this 
algorithm  when  implemented  on  a  parallel  machine 
The  “back-solve”  process,  when  a  had  column  is 
found,  involves  accessing  the  partially  formed  upper 
triangular  factor.  This  also  causes  the  algorithm  to 
come  to  a  virtual  pause,  loosing  some  of  the  advan¬ 
tages  of  the  asynchronous  behavior  of  the  algorithm. 
However,  this  happens  only  occasionally,  so  we  can 
still  expect  some  good  speed-ups. 

The  “look-ahead”  part  of  the  algorithm,  where  it 
needs  to  find  out  which  value  of  p  is  to  be  made  per¬ 
manent,  is  another  bottleneck.  There  are  a  couple 
ways  to  get  around  this  problem.  We  can  make  the 
look-ahead  local  to  the  columns  held  by  that  proces¬ 
sor  only.  But  then,  we  may  be  compromising  on  the 
quality  of  the  estimate  computed  by  the  algorithm. 
The  other  alternative  is  to  maintain  a  fixed  number 
of  possibilities(say  z  =  4)  of  the  values  of  p  and  in  the 
steady  state,  by  the  time  we  each  processor  is  ready 
to  process  a  column  y,  it  would  have  enough  infor¬ 
mation  to  fix  the  value  of  p  corresponding  to  column 

(y-  z)- 

This  algorithm  was  implemented  on  a  Intel  iPSC/2 
Hypercube  and  the  Table  1  summarizes  the  speed-ups 
that  have  been  obtained.  The  test  matrix  with  4000 
columns  and  approximately  30000  non-zero  elements 
in  the  factored  matrix  ,was  generated  randomly  with 


Table  1 :  Timing  results  on  a  random  sparse  matrix 


no.  of 

time 

processors 

(secs) 

2 

15.20 

4 

10.78 

8 

7.76 

16 

5.54 

Table  2:  Results  of  our  condition  estimation  tests 


ifcz 

n  =  10 

25 

50 

10 

0.36/0.67 

0.33/0.53 

0.30/0.43 

10^ 

0.20/0.58 

0.20/0  42 

0.22/0.37 

10® 

0.11/0.48 

0.12/0.36 

0.10/0.27 

10® 

0.12/0.51 

0.12/0.33 

0.09/0.26 

a  random  sparsity  structure.  The  speed-up  obtained 
is  by  a  factor  of  1.4  for  doubling  the  number  of  pro¬ 
cessors. 

6.  Robustness  of  the  algorithm 

To  test  the  effectiveness  of  this  estimator,  we  used 
this  algorithm  to  estimate  the  condition  number  of  a 
given  matrix.  As  was  suggested  by  Stewart  [12],  we 
generated  random  test  matrices  of  dimension  10,  25 
and  50  with  a  known  condition  number  —  the  values 
being  l.OEl,  1.0E3,  1.0E6  and  1.0E9.  For  each  of 
the  possibilities,  we  generated  two  types  of  matrices 
—  one  where  there  is  a  sharp  break  in  the  singular 
value  distribution  and  the  other  in  which  the  singular 
values  are  exponentially  distributed  between  1  and 
the  condition  number. 

The  algorithm  always  estimated  correctly  (within 
2  decimal  digit  accuracy),  if  there  is  a  sharp  break  in 
the  singular  value  distribution  and  hence  the  results 
in  Table  2  only  illustrate  the  C2ise  where  there  is  a 
exponential  distribution  of  singular  values.  For  each 
dimension  n,  50  test  matrices  were  generated.  ki  is 
the  actual  condition  number  of  the  test  matrix.  The 
numbers  quoted  in  each  entry  represent  the  minimum 
/  average  value  of  the  ratio  of  the  estimated  condition 
number  to  the  actual  value.  The  results  are  rounded 
to  two  significant  digits,  so  a  ratio  of  1.0  implies  that 
the  estimate  had  at  least  2  correct  digits. 

Comparative  results  are  included  in  Table  3  for 
LINPACK  (taken  from  Higham  [11])  and  in  Table  4 
for  Bischof’s  estimator  (taken  from  Bischof  [1]). 


326 


Table  3;  Results  of  LINPACK  condition  estimation 
tests 


k2 

n  =  10 

25 

50 

10 

0.29/0.46 

0.24/0.30 

0.17/0.23 

10^ 

0.29/0.56 

0.20/0.33 

0.19/0.26 

10® 

0.46/0.76 

0.20/0.46 

0.22/0.35 

10^ 

0.68/0.86 

0.24/0.55 

0.23/0.40 

Table  4:  Results  of  Bischof’s  condition  estimation 
tests 


ki 

n  =  10 

25 

50 

10 

0.56/0.77 

0.59/0.71 

0.63/0.71 

10^ 

0.33/0.53 

0.40/0.50 

0.31/0.45 

10® 

0.12/0.53 

0.16/0.38 

0.24/0.36 

10® 

0.16/0.45 

0.17/0.33 

0.19/0.31 

7.  Conclusions 

We  proposed  an  incremental  condition  estimation  al¬ 
gorithm  that  is  well  suited  for  parallel  sparse  matrix 
factorizations.  Empirical  results  provided  suggest 
that  the  estimator  is  robust  enough.  Good  speed- 
ups  have  been  demonstrated  on  randomly  generated 
test  problems. 
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Abstract 

Efficient  sparse  linear  algebra  cannot  be  achieved  as 
a  straightforward  extension  of  the  dense  case,  even 
for  concurrent  implementations.  This  paper  details 
a  new,  general-purpose  unsymmetric  sparse  LU  fac¬ 
torization  code  built  on  the  philosophy  of  Harwell’s 
MA28,  with  variations.  We  apply  this  code  in  the 
framework  of  Jacobian-matrix  factorizations,  arising 
from  Newton  iterations  in  the  solution  of  nonlinear 
systems  of  equations.  Serious  attention  has  been  paid 
to  the  data-structure  requirements,  complexity  issues 
and  communication  features  of  the  algorithm.  Key  re¬ 
sults  include  reduced  communication  pivoting  for  both 
the  “analyze”  A-mode  and  repeated  B-mode  factoriza¬ 
tions,  and  effective  gener^d-purpose  data  distributions 
useful  incrementally  to  tr2ide-off  process-column  load 
balance  in  factorization  against  triangular  solve  perfor¬ 
mance.  Future  plemned  efforts  are  cited  in  conclusion. 

Introduction 

The  topic  of  this  paper  is  the  implementation  and  con¬ 
current  performance  of  sparse,  unsymmetric  LU  fac¬ 
torization  for  medium-grain  multicomputers.  Our  tar¬ 
get  hardware  is  distributed-memory,  message-passing 
concurrent  computers  such  as  the  Symult  s2010  and 
Intel  iPSC/2  systems.  For  both  of  these  systems,  ef¬ 
ficient  cut-through  wormhole  routing  technology  pro¬ 
vides  pair-wise  communication  performance  essentially 
independent  of  the  spatial  location  of  the  comput¬ 
ers  in  the  ensemble  [2].  The  Symult  s2010  is  a  two- 
dimensional,  mesh-connected  concurrent  computer;  all 
examples  in  this  paper  were  run  on  this  variety  of  hard¬ 
ware.  Message-passing  performance,  portability  and 


related  issues  relevant  to  this  work  are  detailed  in  [7]. 

Questions  of  linear-algebra  performance  are  perva¬ 
sive  throughout  scientific  and  engineering  computa¬ 
tion.  The  need  for  high-quality,  high-performance  lin¬ 
ear  algebra  algorithms  (and  libraries)  for  multicom¬ 
puter  systems  therefore  requires  no  attempt  at  justi¬ 
fication.  The  motivation  for  the  work  described  here 
has  a  specific  origin,  however.  Our  main  higher- level 
research  goal  is  the  concurrent  dynamic  simulation  of 
systems  modelled  by  ordinary  differential  and  alge¬ 
braic  equations;  specifically,  dynamic  flowsheet  sim¬ 
ulation  of  chemical  plants  {e.g.,  coupled  distillation 
columns)  [8].  Efficient  sequential  integration  algo¬ 
rithms  solve  staticized  nonlinear  equations  at  each 
time  point  via  modified  Newton  iteration  (c/.,  [3], 
Chaq>ter  5).  Consequently,  a  sequence  of  structurally 
identical  linear  systems  must  be  solved;  the  matri¬ 
ces  are  finite-difference  approximations  to  Jacobians  of 
the  staticized  system  of  ordinary  differential-algebraic 
equations.  These  Jacobians  are  large,  sparse  and  un¬ 
symmetric  for  our  application  area.  In  general,  they 
possess  both  band  and  significant  off-band  structure. 
Generic  structures  are  depicted  in  Figure  0.  This 
work  should  also  bear  relevance  to  electric  power  net¬ 
work/grid  dynamic  simulation  where  sparse,  unsym¬ 
metric  Jacobians  also  arise,  and  also  elsewhere. 

Design  Overview 

We  solve  the  problem  Ax  =  b  where  A  is  large,  and 
includes  many  zero  entries.  We  assume  that  A  is  un¬ 
symmetric  both  in  sparsity  pattern  and  in  numerical 
values.  In  general,  the  matrix  A  will  be  computed  in 
a  distributed  fashion,  so  we  will  inherit  a  distribution 
of  the  coefficients  of  A  {cf,  Figures  2.,  3.).  Follow- 
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Figure  0.  Example  Jacobian  Matrix  Structures. 


In  chemical-engineering  process  flowsheets,  Jacobians  with 
main  band  structure,  and  lower-triangular  structure  (feed¬ 
forwards),  upper-triangular  structure  (feedbacks),  and  bor¬ 
ders  (global  or  aitiflcially  restructured  feedforwards  and/or 
feedbacks)  are  common. 

ing  the  style  of  Harwell’s  MA28  code  for  unsymmetric 
sparse  matrices,  we  use  a  two-phase  approach  to  this 
solution.  There  is  a  first  LU  factorization  called  A- 
mode  or  “analyze,”  which  builds  data  structures  dy- 
naunically,  and  uses  a  user-deHned  pivoting  function. 
The  repeated  B-mode  factorization  uses  the  existing 
data  structures  statically  to  factor  a  new,  similarly 
structured  matrix,  with  the  previous  pivoting  pattern. 
B-mode  monitors  stability  with  a  simple  growth  factor 
estimate.  In  practice,  A-mode  is  repeated  whenever  in¬ 
stability  is  detected.  The  two  key  contributions  of  this 
sparse  concurrent  solver  are:  reduced  communication 
pivoting,  and  new  data  distributions  for  better  overall 
performance. 

Following  Van  de  Velde  [11],  we  consider  the  LU  fac¬ 
torization  of  a  real  matrix  A,  A  £  It  is  well 

known  {e.g.,  [6],  pp.  117-118),  that  for  any  such  ma¬ 
trix  A,  an  LU  factorization  of  the  form 

PrAPI =  w 

exists,  where  Pn,  Pc  are  square,  (orthogonal)  permu¬ 
tation  matrices,  and  L,  U  are  the  unit  lower-triangular. 


Figure  1.  Linked-list  Entry  Structure  of  Sparse 
Matrix. 


A  single  entry  consists  of  a  double-precision  value  (8  bytes), 
the  local  row  (i)  and  column  (j)  index  (2  bytes  each),  a 
“Next  Column  Pointer”  indicating  the  next  current  column 
entry  (fixed  j),  and  a  “Next  Row  Pointer”  indicating  the 
next  current  row  entry  (fixed  i),  at  4  bytes  each.  Total:  24 
bytes  per  entry. 

and  upper-triangular  factors,  respectively.  Whereas 
the  pivot  sequence  is  stored  (two  7V-length  integer 
vectors),  the  permutation  matrices  Eire  not  stored 
or  computed  with  explicitly.  Rearranging,  based  on 
the  orthogonality  of  the  permutation  matrices,  A  =■ 
P^LUPc-  We  factor  A  with  implicit  pivoting  (no  rows 
or  columns  are  exchanged  explicitly  as  a  result  of  piv¬ 
oting).  Therefore,  we  do  not  store  L,  U  directly,  but  in¬ 
stead:  L  =  P^ZPc,  U  =  P\UPc-  Consequently,  L  = 
PrLPI,  U  =  PrUPI,  and  A  =  L[PIPr)U.  The  “un¬ 
ravelling”  of  the  permutation  matrices  is  accomplished 
readily  (without  implication  of  additional  interprocess 
communication)  during  the  triangular  solves. 

For  the  sparse  case,  performance  is  more  difficult  to 
quantify  than  for  the  dense  case,  but,  for  example, 
banded  matrices  with  bandwidth  P  can  be  factored 
with  O{0^N)  work;  we  expect  sub-cubic  complexity  in 
N  for  reasonably  sparse  matrices,  and  strive  for  sub¬ 
quadratic  complexity,  for  very  sparse  matrices.  The 
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triangular  solves  can  be  accomplished  in  work  pro¬ 
portional  to  the  number  of  entries  in  the  respective 
triangular  matrix  L  ot  U.  The  pivoting  strategy  is 
treated  as  a  parameter  of  the  algorithm  and  is  not 
pre-determined.  We  can  consequently  treat  the  piv¬ 
oting  function  as  an  application-dependent  function, 
and  sometimes  tailor  it  to  special  problem  structures 
(c/.,  Section  7  of  [9])  for  higher  performance.  As  for 
all  sparse  solvers,  we  also  seek  sub-quadratic  memory 
requirements  in  N,  attained  by  storing  matrix  entries 
in  linked-list  fashion,  as  illustrated  in  Figure  1. 

For  further  discussion  of  LU  factorizations  and  sparse 
matrices,  see  [6,4]. 

Reduced-Communication  Pivoting 

At  each  stage  of  the  concurrent  LU  f2tctorization,  the 
pivot  element  is  chosen  by  the  user-defined  pivot  func¬ 
tion.  Then,  the  pivot  row  (new  row  of  U)  must  be 
broadcast,  and  pivot  colunm  (new  column  of  L)  must 
be  computed  and  broadcast  on  the  logical  process  grid 
(c/.,  Figure  2.),  vertically  and  horizontally,  respec¬ 
tively.  Note  that  these  are  interchangeable  operations. 
We  use  this  degree-of-freedom  to  reduce  the  commu¬ 
nication  complexity  of  particular  pivoting  strategies, 
while  impacting  the  effort  of  the  LU  factorization  it¬ 
self  negligibly. 

We  define  two  “correctness  modes”  of  pivoting  func¬ 
tions.  In  the  first  correctness  mode  “first  row  fanout,” 
the  exit  conditions  for  the  pivot  function  are:  all  pro¬ 
cesses  must  know  p  (the  pivot  process  row),  the  pivot 
process  row  must  know  q  (the  pivot  process  column)  eis 
well  as  i,  the  p-local  matrix  row  of  the  pivot,  and  the 
pivot  process  must  know  in  addition  the  pivot  value 
and  9- local  matrix  column  j  of  the  pivot.  Partial  col¬ 
unm  pivoting  and  preset  pivoting  can  be  setup  to  sat¬ 
isfy  these  correctness  conditions  as  follows.  For  partial 
column  pivoting,  the  Jfcth  row  is  eliminated  at  the  kth 
step  of  the  factorization.  From  this  fact,  each  process 
can  derive  the  process  row  p  and  j^local  matrix  row  i 
using  the  row  data  distribution  function.  Having  iden¬ 
tified  themselves,  the  pivot-row  processes  can  look  for 
the  largest  element  in  local  matrix  row  i  and  choose  the 
pivot  element  globally  among  themselves  via  a  com¬ 
bine.  At  completion  this  places  g,  j  and  the  pivot 
value  in  the  entire  pivot  process  row.  This  completes 
the  requirements  for  the  “first  row  fanout”  correctness 
mode.  For  preset  pivoting,  the  Irth  elimination  row 
and  column  are  both  stored  as  p,  i,  q, },  and  each  pro¬ 
cess  knows  these  values  without  communication}  Fur¬ 
thermore,  the  pivot  process  looks  up  the  pivot  value. 

'Memory  uiMcalabilJties  can  be  removed  very  cheaply;  see 

[8). 


Hence,  preset  pivoting  satisfies  the  requirements  of  this 
correctness  mode  also. 

For  “first  row  fanout,”  the  universal  knowledge  of  p 
and  knowledge  of  the  pivot  matrix  row  i  by  the  pivot 
process  row,  allows  the  vertical  broadcast  of  this  row 
(new  row  of  U).  In  addition,  we  broadcast  q,  j  and 
the  pivot  value  simultaneously.  This  extends  the  cor¬ 
rect  value  of  q  to  all  processes,  as  well  as  j  and  the 
pivot  value  to  the  pivot  process  column.  Hence,  the 
multiplier  (L)  colunm  may  be  correctly  computed  and 
broadcast.  Along  with  the  multiplier  column  broad¬ 
cast,  we  include  the  pivot  value.  After  this  broadcast, 
all  processes  have  the  correct  indices  p,i,q,j  and  the 
pivot  value.  This  provides  all  that’s  required  to  com¬ 
plete  the  current  elimination  step. 

For  the  second  correctness  mode  “first  colunm  fanout,” 
the  exit  conditions  for  the  pivot  function  are:  all  pro¬ 
cesses  must  know  q,  the  entire  pivot  process  colunm 
must  know  j,  the  pivot  value,  and  p.  The  pivot  pro¬ 
cess  in  addition  knows  t.  Partial  row  pivoting  can  be 
setup  to  satisfy  these  correctness  conditions.  The  ar¬ 
guments  are  analogous  to  partial  column  pivoting  and 
are  given  in  [8]. 

For  “first  colunm  fanout,”  the  entire  pivot  process  col¬ 
unm  knows  the  pivot  value,  and  local  colunm  of  the 
pivot.  Hence,  the  multiplier  column  may  be  computed 
by  dividing  the  pivot  matrix  column  by  the  pivot  value. 
This  column  of  L  may  then  be  broadcast  horizontally, 
including  the  pivot  value,  p  and  t  as  additional  infor¬ 
mation.  After  this  step,  the  entire  ensemble  has  the 
correct  pivot  value,  and  p;  in  ^lddition,  the  pivot  pro¬ 
cess  row  has  the  correct ».  Hence,  the  pivot  matrix  row 
may  be  identified  and  broadcast.  This  second  broad¬ 
cast  completes  the  needed  information  in  each  process 
for  effecting  the  tth  elimination  step. 

Hence,  when  using  partial  row  or  partial  column  piv¬ 
oting,  only  local  combines  of  the  pivot  process  colunm 
(respectively  row)  are  needed.  The  other  processes 
don’t  participate  in  the  combine,  as  they  must  with¬ 
out  this  methodology.  Preset  pivoting  implies  no  piv¬ 
oting  communication,  except  very  occasionally  (e.jr., 
1  in  5000  times)  as  noted  in  [8]  to  remove  memory 
unscalabilities.  This  pivoting  approach  is  a  direct  sav¬ 
ings,  gained  at  a  negligible  additional  broadcast  over¬ 
head.  See  also  [8]. 

New  Data  Distributions 

We  introduce  new  closed-form  0(l)-time,  0(1)- 
memory  data  distributions  useful  for  sparse  matrix  fac¬ 
torizations  and  the  problems  that  generate  such  matri¬ 
ces.  We  quantify  evaluation  costs  in  Table  0.  Every 
concurrent  data  structure  is  associated  with  a  logi- 
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Table  0.  Data-Distribution  Function  and  Inverse  Costs 

Distribution: 

fi(I,P,M)  p-\p,i,P,M) 

One- Parameter  (() 

Two-Parameter  (^) 

Block-Linear  (A) 

For  the  data  distributions  and  inverses  described  here,  evaluation  time  in  /is  is  quoted  for  the  Symult  s2010  multicomputer. 
Cardinality  function  calls  are  inexpensive,  and  fall  within  lower-order  work  anyway  -  their  timing  is  hence  omitted.  The 
cheapest  distribution  function  (scatter)  costs  x  15/is  by  way  of  comparison. 


Figure  2.  Process  Grid  Data  Distribution  of  Ax  =  6. 


0  12  3 

DDDD 

DDDD 


DDDfl 


OD 


B 


00 


□ODD  DO  DO 

□ODD  Bo  Bo 
BdBdHBd 
Bo  Boo  Bo 


b. 


Representation  of  a  concurrent  matrix,  and  distributed- 
replicated  concurrent  vectors  on  a  4x4  logical  process  grid. 
The  solution  of  Ax  =  b  first  appears  in  x,  a  column- 
distributed  vector,  and  then  is  normally  “transposed”  via 
a  global  combine  to  the  row-distributed  vector  y. 


of  vectors  and  matrices  are  distributed  according  to 
one  of  several  data  distributions.  Data  distributions 
are  chosen  to  compromise  between  load-balancing  re¬ 
quirements  and  constraints  on  where  information  c£in 
be  calculated  in  the  ensemble. 

Definition  1  (Data-Distribution  Function) 

A  data- distribution  function  n  maps  three  integers 
(p, »)  where  I,  0  <  I  <  M,  is  the 
global  name  of  a  coefficient,  P  is  the  number  of  pro¬ 
cesses  among  which  all  coefficients  are  to  be  parti¬ 
tioned,  and  M  is  the  total  number  of  coefficients.  The 
pair  (p, »)  rep-'esents  the  process  p  (0  <  p  <  P) 
and  local  (pro  ss-p)  name  i  of  the  coefficient  (0  < 
i  <  p^(p,  P,M)).  The  inverse  distribution  function 
fi~^{p,i,P,M)  !-♦  I  transforms  the  local  name  i  back 
to  the  global  coefficient  name  I. 

The  formal  requirements  for  a  date  distribution  func¬ 
tion  are  as  follows.  Let  2^  be  the  set  of  global  co¬ 
efficient  names  associated  with  process  p,  0  <  p  < 
P,  defined  implicitly  by  a  data  distribution  function 
p(«,  P,  M).  The  following  set  properties  must  hold: 

JP'nJ'’’  =  0,  Vpi^P2,  0<pi,p2<P 

p-i 

(JZ"  =  {0,...,A/-1}  =  Jm 

p=0 


cal  process  grid  at  creation  (c/.,  Figure  2.  and  [7,8]). 
Vectors  are  either  row-  or  column-distributed  within  a 
two-dimensional  process  grid.  Rnw-distributed  vectors 
are  replicated  in  each  process  column,  and  distributed 
in  the  process  rows.  Conversely,  column-distributed 
vectors  are  replicated  in  each  process  row,  and  dis¬ 
tributed  in  the  process  columns.  Matrices  are  dis¬ 
tributed  both  in  rows  and  columns,  so  that  a  single 
process  owns  a  subset  of  matrix  rows  and  columns. 
This  partitioning  follows  the  ideas  proposed  by  Fox  el 
al.  [5]  and  others.  Within  the  process  grid,  coefficients 


The  cardinality  of  the  set  IP,  is  given  by  p}{p,P,M). 

The  linear  and  scatter  data-distribution  functions  are 
most  often  defined.  We  generalize  these  functions 
(by  blocking  and  scattering  parameters)  to  incorpo¬ 
rate  practically  important  degrees  of  freedom.  These 
generalized  distribution  functions  yield  optimal  static 
load  balance  as  do  the  unmodified  functions  described 
in  [11]  for  unit  block  size,  but  differ  in  coefficient  place¬ 
ment.  This  distinction  is  technical,  but  necessary  for 
efficient  implementations. 
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Definition  2  (Generalized  Block-Linear) 

The  definitions  for  the  generalized  block-linear  distri¬ 
bution  function,  inverse,  and  cardinality  function  are: 


P 


while 


P-1- 


max 


{  IT 

IT-r  \ 

U/  +  iJ 

'[  1  \) 

/-B(p/-|-e‘(p-(P-r))), 


function)  is  mainly  suited  to  the  clustering  of  coeffi¬ 
cients  that  must  not  be  separated  by  an  interprocess 
boundary  (again,  see  [8]  for  a  definition  of  general 
block-scatter,  a ).  Increasing  B  worsens  the  static  load 
balance.  Adding  a  second  scaling  parameter  S  (of  no 
impact  on  the  static  load  balance)  allows  the  distribu¬ 
tion  to  scatter  coefficients  to  a  greater  or  lesser  degree, 
directly  as  a  function  of  this  one  parameter.  The  two- 
parameter  distribution  function,  inverse  and  cardinal¬ 
ity  function  are  defined  below.  The  one-parameter  dis¬ 
tribution  function  family,  occurs  as  the  special  case 
B  —  I,  also  as  noted  below: 


Xs^[p,i,P,M)  =  i  +  P((pJ  +  ei(p_(P_r))), 


X's(p,P,M)  = 


(A/  mod  B)6, 

where  B  denotes  the  coefficient  block  size, 

if  M  mod  B  =  0 
otherwise, 


{M 

Lfl  +  1 


Ib 


/r  =  b-i-iB, 

b  mod  P  , 


e‘(o  = 


[flj  ’ 

~  [~J  ^  ~ 

{0  t  <  0 

t  >  0,  ib  >  0  , 

1  t>0,  k  =  Q 

=  mod  B) 


and  where  b>  P. 

For  B  —  \,  a  load-balance- equivalent  variant  of  the 
common  linear  daia-distribution  function  is  recovered. 
The  general  block-linear  distribution  function  divides 
coefficients  among  the  P  processes  p  =  0, . . .  ,P  —  1 
so  that  each  is  a  set  of  coefficients  with  contigu¬ 
ous  global  names,  while  optimally  load-balancing  the  b 
blocks  among  the  P  sets.  Coefficient  boundaries  be¬ 
tween  processes  are  on  multiples  of  B.  The  maximum 
possible  coefficient  imbalance  between  processes  is  B. 
If  B  mod  P  ^  0,  the  last  block  in  process  P  —  1  will  be 
foreshortened. 


^b.s(I,P,M) 


(P<i)  =  I 


^  I  (Po.io) 
(Pi.»i) 


Aq  >  Is 
Ao  <  Is 


where 


Zs 

[iJ 

,  Ao  H 

(po,»o) 

*-  Ab(j 

f,P,M), 

Ibs 

II 

O 

+  Ao, 

Pi 

=  Ibs 

mod  P  , 

=  BS 

-  [bsJ’ 


with 


Cb,s(/,P,M)  =  6.s(/,P.M), 

d,s(P.-P.A^)  =  CUpP,M) 

=  X^g{p,P,M), 


and  where  r,  b,  etc.  orie  as  defined  above.  The  inverse 
distribution  function  is  defined  as  follows: 


Ci.5(P>i-P-A^) 

(P*.i*) 

A 


P2 


*2 


with 


I  =  Ab‘(p*,P,P,M). 

f  (P.O  A>Zs 
1  (P2,«2)  A<ls  , 


BS(Igs  ^s)  +  (»  niod  BS), 


Definition  3  (Parametric  Functions) 

To  allow  greater  freedom  in  the  distribution  of  co¬ 
efficients  among  processes,  we  define  a  new,  two- 
parameter  distribution  function  family,  The  B 
blocking  parameter  (just  introduced  in  the  block-linear 


CsHp.i,PM)  =  ^rj(p,i,P,M). 

For  S  =  I,  a  block-scatter  distribution  results,  while 
for  S  >  Sent  =  Zs  +  li  the  generalized  block-linear 
distribution  function  is  recovered.  See  also  [8j. 
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Definition  4  (Data  Distributions) 

Given  a  data-distribution  function  family 

,!/*)),  a  process  list  of  P  (Q),  M  (N)  as  the 
number  of  coefficients,  and  a  row  (respectively,  col¬ 
umn)  orientation,  a  row  (column)  data  distribution 

grow  fgcol^ 

respectively, 

g^o‘={{u,u-\u^)-,Q,N}. 

A  two-dimensional  data  distribution  may  be  identified 
as  consisting  of  a  row  and  column  distribution  defined 
over  a  two-dimensional  process  grid  ofPxQ  processes, 

asg  =  (g’‘^,g‘‘>'). 

Further  discussion  and  detailed  comparisons  on  data- 
distribution  functions  are  offered  in  [8].  Figure  3.  illus¬ 
trates  the  effects  of  linear  and  scatter  data-distribution 
functions  on  a  small  rectangular  array  of  coefficients. 

Performance  vs.  Scattering 

Consider  a  fixed  logical  process  grid  of  R  processes, 
with  PxQ  =  R.  For  the  sake  of  argument,  assume 
partial  row  pivoting  during  LU  factorization  for  the 
retention  of  numerical  stability.  Then,  for  the  LU  fac¬ 
torization,  it  is  well  known  that  a  scatter  distribution  is 
“good”  for  the  matrix  rows,  and  optimal  were  there  no 
off-diagonal  pivots  chosen.  Furthermore,  the  optimal 
column  distribution  is  also  scatter,  because  columns 
are  chosen  in  order  for  partial  row  pivoting.  Com¬ 
patibly,  a  scatter  distribution  of  matrix  rows  is  also 
“good”  for  the  triangular  solves.  However,  for  trian¬ 
gular  solves,  the  best  column  distribution  is  linear,  be¬ 
cause  this  implies  less  intercolumn  communication,  as 
we  detail  below.  In  short,  the  optimal  configurations 
conflict,  and  because  explicit  redistribution  is  expen¬ 
sive,  a  static  compromise  must  be  chosen.  We  address 
this  need  to  compromise  through  the  one-parameter 
distribution  function  ^  described  in  the  previous  sec¬ 
tion,  offering  a  variable  degree  of  scattering  via  the  5- 
parameter.  To  first  order,  changing  5  does  not  affect 
the  cost  of  computing  the  Jacobian  (assuming  column¬ 
wise  finite-difference  computation),  because  each  pro¬ 
cess  column  works  independently. 

It’s  important  to  note  that  triangular  solves  derive  no 
benefit  from  Q  >  1.  The  standard  column-oriented 
solve  keep  one  process  column  active  at  any  given 
time.  For  any  column  distribution,  the  updated  right- 
hand-side  vectors  are  retransmitted  W  times  (process 
column-to-process  column)  during  the  triangular  solve 


-  whenever  the  active  process  column  changes.  There 
are  at  least  W^in  =  Q  —  1  such  transmissions  (linear 
distribution),  and  at  most  Wmat  =  N  —  I  transmis¬ 
sions  (scatter  distribution).  The  complexity  of  this 
retransmission  is  0{WN/P),  representing  quadratic 
work  in  N  for  W  N. 

Calculation  complexity  for  a  sparse  triangular  solve  is 
proportional  to  the  number  of  elements  in  the  trian¬ 
gular  matrix,  with  a  low  leading  coefficient.  Often, 
there  are  0(A1'  *)  with  *  <  1  elements  in  the  trian¬ 
gular  matrices,  including  fill.  This  operation  is  then 
0(N^  *  JP),  which  is  less  than  quadratic  in  N.  Conse¬ 
quently,  for  large  W,  the  retransmission  step  is  likely 
of  greater  cost  than  the  original  calculation.  This  re¬ 
transmission  effect  constrains  the  amount  of  scattering 
and  size  of  Q  in  order  to  have  any  chance  of  concurrent 
speedup  in  the  triangular  solves. 

Using  the  one-parameter  distribution  with  5  >  1  im¬ 
plies  that  W  fts  N/S,  so  that  the  retransmission  com¬ 
plexity  is  0{N^/SP).  Consequently,  we  can  bound 
the  amount  of  retransmission  work  by  picking  S  suffi¬ 
ciently  large.  Clearly,  S  =  Scrit  is  a  hard  upper  bound, 
because  we  reach  the  linear  distribution  limit  at  that 
value  of  the  parameter.  We  suggest  picking  S  «  10  as 
a  first  guess,  and  S  ~  y/N,  more  optimistically.  The 
former  choice  basically  reduces  retransmission  effort 
by  an  order  of  magnitude.  Both  examples  in  the  fol¬ 
lowing  section  illustrate  the  effectiveness  of  choosing 
S  by  these  heuristics. 

The  two-parameter  ^  distribution  can  be  used  on  the 
matrix  rows  to  tradeoff  load  balance  in  the  factoriza¬ 
tions  and  triangular  solves  against  the  amount  of  (com¬ 
munication)  effort  needed  to  compute  the  Jacobian.  In 
pEirticuiar,  a  greater  degree  of  scattering  can  dramat¬ 
ically  increase  the  time  required  for  a  Jacobian  com¬ 
putation  (depending  heavily  on  the  underlying  equa¬ 
tion  structure  and  problem),  but  importantly  reduce 
load  imbalance  during  the  linear  algebra  steps.  The 
communication  overhead  caused  by  multiple  process 
rows  suggests  shifting  toward  smaller  P  and  larger  Q 
(a  squatter  grid),  in  which  case  greater  concurrency  is 
attained  in  the  Jacobian  computation,  and  the  addi¬ 
tional  communication  previously  induced  is  then  some¬ 
what  mitigated.  The  one-parameter  distribution  used 
on  the  matrix  columns  then  proves  effective  in  control¬ 
ling  the  cost  of  the  triangular  solves  by  choosing  the 
minimally  allowable  amount  of  column  scattering. 

Let’s  make  explicit  the  performance  objectives  we  con¬ 
sider  when  tuning  S,  and,  more  generally,  when  tuning 
the  grid  shape  PxQ  =  R.  In  the  modified  Newton  it¬ 
eration,  for  instance,  a  Jacobian  factorization  is  reused 
until  convergence  slows  unacceptably.  An  “LU  Factor- 
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Figure  3.  Example  of  Process-Grid  Data  Distribution 


/  ^0,0  ^0,1  ^0,2  ^0,3  \ 

^1.0  ^1,1  ^1.2  ^1.3  I 

^2,0  ^2.1  ^2.2  ^2,3 

^3,0  ^3,1  ^3,2  ^3,3  y 


An  11  X  9  array  with  block-linear  rows  (B  =  2)  and  scattered  columns  on  a  4  x  4  logical  process 


grid.  Local  arrays  are  denoted  at  left  by  A’’''’  where  (p,g)  is  the  grid  position  of  the  process  on  ^  = 
({(•^2,  AJ*,  a');  P  =  4,  M  =  11},  {(<ri,<rj~',<r});<3  =  4,  AT  =  9}).  Subscripts  (i.e.,  ai,j)  are  the  global  (I,  J)  indices. 


ization  -|-  Backsolve”  step  is  followed  by  rj  “Forward  -b 
Backsolves,”  with  ij  ~  0(1)  typically  (and  varying  dy¬ 
namically  throughout  the  calculation).  Assuming  an 
averaged  rj,  say  q*  (perhaps  as  large  as  five  [3]),  then 
our  first-level  performance  goal  is  a  heuristic  minimiza¬ 
tion  of 

I'm  "b  (rj  +  l)TBock  "b  'J  '^Forward 

over  S  for  fixed  P,Q.  rj*  >  I  more  heavily  weights  the 
reduction  of  triangular  solve  costs  vs.  B-mode  factor¬ 
ization  than  we  might  at  first  have  ^lssumed,  placing 
a  greater  potential  gain  on  the  one-parameter  distri¬ 
bution  for  higher  overall  performance.  We  generally 
want  heuristically  to  optimize 

Tjac  +  Tlu  +  (r)*  -b  l)TBack  +  »?*T>orti;ard 

over  5,  P,  Q,  R.  Then,  the  possibility  of  fine-tuning 
row  and  column  distributions  is  important,  as  b  the 
use  of  non-power-of-two  grid  shapes. 

Performance 

Order  13040  Example 

We  consider  an  order  13040  banded  matrix  with  a 
bandwidth  of  326  under  partial  row  pivoting.  For  this 
example,  we  have  compiled  timing  results  for  a  16x12 
process  grid  with  random  matrices  (entries  have  range 
0-10,000)  using  different  values  of  5  on  the  column 
distribution  (see  Table  1).  We  indicate  timing  for  A- 
mode,  B-mode,  Backsolves  and  Forward-  and  Back- 
solves  together  (“Solve”  heading).  For  this  example, 


5  =  30  saves  76%  of  the  triangular  solve  cost  compared 
to  5  =  1,  or  approximately  186  seconds,  roughly  6  sec¬ 
onds  above  the  linear  optimal.  Simultaneously,  we  in¬ 
cur  about  17  seconds  additional  cost  in  B-nK>de,  while 
saving  about  93  seconds  in  the  Backsolve.  Assuming 
Tj*  =  1  {t}*  =  0),  in  the  first  above-mentioned  objective 
function,  we  save  about  262  (respectively,  76)  seconds. 
Based  on  this  example,  and  other  experience,  we  con¬ 
clude  that  this  is  a  successful  practical  technique  for 
improving  overall  speuse  lineeir  algebra  performance. 
The  following  example  further  bolsters  this  conclusion. 

Order  2500  Example 

Now,  we  turn  to  a  timing  example  of  an  order  2500 
sparse,  random  matrix.  The  matrix  has  a  random 
diagonal,  plus  two-percent  random  fill  of  the  off- 
diagonals;  entries  have  a  dynamic  range  of  0-10,000. 
Normally,  data  is  averaged  over  random  matrices  for 
eatch  grid  shape  (as  noted),  and  over  four  repetitive 
runs  for  each  random  matrix.  Partial  row  pivoting 
was  used  exclusively.  Table  2.  compiles  timings  for 
various  grid  shapes  of  row-scatter/column-scatter,  and 
row-scatter  /  colunui-(5  =  10)  distributions,  for  as  few 
as  nine  nodes  and  as  many  as  128.  Memory  Umitations 
set  the  lower  bound  on  the  number  of  nodes. 

This  example  demonstrates  that  speedups  are  possi¬ 
ble  for  this  reasonably  small  sparse  example  with  this 
general-purpose  solver,  and  that  the  one-parameter 
distribution  is  key  to  achieving  overall  better  perfor¬ 
mance  even  for  this  random,  essentially  unstructured 
example.  Without  the  one-parameter  distribution,  tri¬ 
angular  solver  performance  is  poor,  except  in  grid  con- 


Table  1.  Order  13040  Band  Matrix  Performance 

Distribution: 
Row  Column 

A-Mode 

(time  in  seconds) 
B-Mode  Back-^lve 

Solve 

Scatter 

S=1 

1.140  X  lO^ 

1.603  X  10® 

1.196  X  10® 

2.426  X  10® 

S=10 

1.148  X  lO^ 

1.696  X  10® 

3.294  X  10* 

6.912  X  10* 

S=25 

1.091  X  10® 

1.670  X  10® 

2.713  X  10* 

5.752  X  10* 

S=30 

1.095  X  10® 

1.769  X  10® 

2.653  X  10* 

5.631  X  10* 

o 

tl 

C/D 

1.116  X  10® 

2.157  X  10® 

2.573  X  10* 

5.472  X  10* 

S=50 

1.127  X  10® 

2.157  X  10® 

2.764  X  10* 

5.743  X  10* 

S=100 

1.279  X  10® 

4.764  X  10® 

2.520  X  10* 

5.367  X  10* 

Linear 

2.247  X  10® 

1.161  X  10® 

2.333  X  10* 

4.993  X  10* 

The  above  timing  data,  for  the  16x12  grid  configuration  with  scattered  rows,  indicates  the  importance  of  the  one-parameter 
distribution  with  5  >  1  for  balancing  factorization  cost  vs.  triangular-sol ve  cost.  The  random  matrices,  of  order  13040, 
have  an  upper  bandwidth  of  164  and  a  lower  bandwidth  of  162.  “Best”  performance  occurs  in  the  range  5  %  25  ..  .40. 


figurations  where  the  factorization  is  itself  degraded 
(t.g.,  2x16).  Furthermore,  the  choice  of  S  =  10  is 
universally  reasonable  for  the  Q  >  1  grid  shapes  il¬ 
lustrated  here,  so  the  distribution  proves  easy  to  tune 
for  this  type  of  matrix.  We  are  able  to  maintain  an 
almost  constant  speed  for  the  triangular  solves  while 
increasing  speed  for  both  the  A-mode  and  B-mode  fac¬ 
torizations.  We  presume,  based  on  experience,  that 
triangular  solve  times  are  comparable  to  the  sequen¬ 
tial  solution  times  -  further  study  is  needed  in  this  area 
to  see  if  and  how  performance  can  be  improved.  The 
consistent  A-mode  to  B-mode  ratio  of  approximately 
two  is  attributed  primarily  to  reduced  communication 
costs  in  B-mode,  realized  through  the  elimination  of 
essentially  ail  combine  operations  in  B-mode. 

While  triangular-solve  performance  exemplifies  se- 
quentialism  in  the  algorithm,  it  should  be  noted  that 
we  do  achieve  significant  overall  performance  improve¬ 
ments  between  9  nodes  and  72  (12x6  grid)  nodes,  and 
that  the  repeatedly  used  B-mode  factorization  remains 
dominant  compared  to  the  triangular  solves  even  for 
128  nodes.  Consequently,  efforts  aimed  further  to  in¬ 
crease  performance  of  the  B-mode  factorization  (at  the 
expense  of  additional  A-mode  work)  are  interesting  to 
consider.  For  the  factorizations,  we  also  expect  that 
we  are  achieving  non-trivial  speedups  relative  to  one 
node,  but  we  are  unable  to  quantify  this  at  present 
because  of  the  memory  limitations  alluded  to  above. 


Future  Work,  Conclusions 

There  are  several  classes  of  future  work  to  be  con¬ 
sidered.  First,  we  need  to  take  the  A-mode  “an¬ 
alyze”  phase  to  its  logical  completion,  by  including 
pivot-order  sorting  of  the  L/U  pointer  structures  to 
improve  performance  for  systems  that  should  demon¬ 
strate  sub-quadratic  sequential  complexity.  This  will 
require  minor  modifications  to  B-mode  (that  already 
takes  advantage  of  column-traversing  elimination),  to 
reduce  testing  for  inactive  rows  as  the  elimination  pro¬ 
gresses.  We  already  realize  optimal  computation  work 
in  the  triangular  solves,  and  we  mitigate  the  effect  of 
Q  >  1  quadratic  communication  work  using  the  one- 
p2U’ameter  distribution. 

Second,  we  need  to  exploit  “timelike”  concurrency  in 
linear  algebra  -  multiple  pivots.  This  has  been  ad¬ 
dressed  by  Alaghband  for  shared-memory  implemen¬ 
tations  of  MA28  with  <?(lV)-complexity  heuristics  [1]. 
These  efforts  must  be  reconsidered  in  the  multicom¬ 
puter  setting  and  effective  variations  must  be  devised. 
This  approach  should  prove  an  important  source  of  ad¬ 
ditional  speedup  for  many  chemical  engineering  appli¬ 
cations,  because  of  the  tendency  towards  extreme  spar¬ 
sity,  with  mainly  band  and/or  block-diagonal  struc¬ 
ture. 

Third,  we  could  exploit  new  communication  strate¬ 
gies  and  data  redistribution.  Within  a  process  grid, 
we  could  incrementally  redistribute  L/U  by  utilizing 
the  inherent  broadcasts  of  L  columns  and  U  rows 
to  improve  load  balance  in  the  triangular  solves  at 
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Table  2.  Order  2500  Matrix  Performance 

Distribution:  I 

(time  in 

seconds) 

Shape 

Row 

Column 

A-Mode 

B-Mode 

Back-Solve 

Solve 

Avgs 

3x3 

Scatter 

Scatter 

3.567  X  10* 

1.783  X  10* 

1.997  X  10* 

4.115  X  10* 

1 

3x4 

Scatter 

3.101  X  10* 

1.303  X  10* 

2.149  X  10* 

4.452  X  10* 

1 

4x3 

Scatter 

2.778  X  10* 

1  526  X  10* 

1.728  X  10* 

3.537  X  10* 

1 

2x16 

Scatter 

4.500  X  10* 

3.350  X  10* 

3.175  X  10° 

1.101  X  10* 

1 

12x1 

Scatter 

2.636  X  10* 

1.206  X  10* 

4.0188  X  10° 

8.340  X  10° 

3 

16x1 

Scatter 

2.085  X  10* 

1.000  X  10* 

4.856  X  10° 

9.8744  X  10° 

3 

8x2 

Scatter 

2.013  X  10* 

9.41  X  10‘ 

1.127  X  10* 

2.295  X  10* 

3 

S=  10 

1.997  X  10* 

9.63  X  10* 

4.508  X  10° 

9.399  X  10° 

3 

4x4 

Scatter 

2.371  X  10* 

1.056  X  10* 

1.225  X  10* 

3.549  X  10* 

3 

5  =  10 

2.329  X  10* 

1.104  X  10* 

4.192  X  10° 

9.406  X  10° 

3 

4x6 

Scatter 

1.456  X  10* 

7.72  X  10* 

1.723  X  10* 

3.528  X  10* 

3 

5=  10 

1.684  X  10* 

8.85  X  10* 

4.206  X  10° 

9.303  X  10° 

3 

12x2 

Scatter 

1.490  X  10* 

6.95  X  10* 

9.08  X  10° 

1.851  X  10* 

3 

5=  10 

1.425  X  10* 

6.54  X  10* 

4.557  X  10° 

9.439  X  10° 

3 

12x3 

Scatter 

1.0429  X  102 

5.39  X  10* 

9.34  X  10° 

1.898  X  10* 

3 

5=  10 

1.0382  X  10* 

5.42  X  10* 

4.539  X  10° 

9.390  X  10° 

3 

8x8 

Scatter 

1.154  X  10* 

6.16  X  10* 

1.1082  X  10* 

2.2906  X  10* 

3 

5=  10 

1.145  X  10* 

6.64  X  10* 

4.4600  X  10° 

9.651  X  10° 

3 

12x6 

Scatter 

6.470  X  10‘ 

3.527  X  10* 

9.410  X  10° 

1.9141  X  10* 

3 

5  =  10 

6.265  X  10‘ 

3.417  X  10* 

4.555  X  10° 

9.495  X  10° 

3 

16x8 

Scatter 

7.046  X  10‘ 

3.879  X  10* 

8.9535  X  10° 

1.8243  X  10* 

3 

5=  10 

6.70  X  10‘ 

3.854  X  10* 

5.239  X  10° 

1.0816  X  10* 

3 

Performance  as  a  function  of  grid  shape  and  size,  and  5-parameter.  “Best"  performance  is  for  the  12x6  grid  with  5  =  10. 


the  expense  of  slightly  more  factorization  computa¬ 
tional  overhead  and  significantly  more  memory  over¬ 
head  (nearly  a  factor  of  two).  Memory  overhead  could 
be  reduced  at  the  expense  of  further  corrununication  if 
explicit  pivoting  were  used  concommitantly. 

Fourth,  we  can  develop  adaptive  broadcast  algorithms 
that  track  the  known  load  imbalance  in  the  B-mode 
factorization,  and  shift  greater  corrununication  empha¬ 
sis  to  nodes  with  less  computational  work  remaining. 
For  example,  the  pivot  column  is  naturally  a  “hot 
spot”  because  the  multiplier  column  {L  column)  must 
be  computed  before  broadcast  to  the  awaiting  process 
columns.  Allowing  the  non-pivot  colurruis  to  handle 
the  majority  of  the  communication  could  be  benefi¬ 
cial,  even  though  this  implies  additional  overall  com¬ 
munication.  Similarly,  we  might  likewise  apply  this  to 


the  pivot  row  broadcast,  and  especially  for  the  pivot 
process,  because  it  must  participate  in  two  broadcast 
operations. 

We  could  utilize  two  process  grids.  When  rows 
(columns)  of  17  (L)  are  broadcast,  extra  broadcasts  to 
a  secondary  process  grid  could  reasonably  be  included. 
The  secondary  process  grid  could  work  on  redistribu¬ 
tion  L/C/  to  an  efficient  process  grid  shape  and  size 
for  triangular  solves  while  the  factorization  continues 
on  the  primary  grid.  This  overlapping  of  communica¬ 
tion  and  computation  could  also  be  used  to  reduce  the 
cost  of  transposing  the  solution  vector  from  column- 
distributed  to  row-distributed,  which  normally  follows 
the  triangular  solves. 

The  sparse  solver  supports  arbitrary  user-defined  piv¬ 
oting  strategies.  We  have  considered  but  not  fully 
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explored  issues  of  fill-reduction  vs.  minimum  time; 
in  particular  we  have  implemented  a  Markowitz-count 
fill-reduction  strategy  [4].  Study  of  the  usefulness  of 
partial  column  pivoting  and  other  strategies  is  also 
needed.  We  will  report  on  this  in  the  future. 

Reduced-communication  pivoting  and  parametric  dis¬ 
tributions  can  be  applied  inunediately  to  concurrent 
dense  solvers  with  definite  improvements  in  perfor¬ 
mance.  While  triangular  solves  remain  lower-order 
work  in  the  dense  case,  and  may  sensibly  admit  less 
tuning  in  S,  the  reduction  of  pivot  communication  is 
certain  to  improve  performance.  A  new  dense  solver 
exploiting  these  ideas  is  under  construction  at  present. 

In  closing,  we  suggest  that  the  algorithms  generat¬ 
ing  the  sequences  of  sparse  matrices  must  themselves 
be  reconsidered  in  the  concurrent  setting.  Changes 
that  introduce  multiple  right-hand  sides  could  help 
to  amortize  linear  algebra  cost  over  multiple  time¬ 
like  steps  of  the  higher-level  algorithm.  Because  of 
inevitable  load  imbalance,  idle  processor  time  is  es¬ 
sentially  free  -  algorithms  that  find  ways  to  use  this 
time  by  asking  for  more  speculative  (partial)  solutions 
appear  of  merit  toward  higher  performance. 
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Abstract: 

We  present  a  method  to  parallelize  any  tridia¬ 
gonal  solver,  in  a  very  efficient  way.  The  com¬ 
munication  overhead  stays  small.  The  parallel 
algorithms  have  nearly  the  same  good  qualities 
as  their  sequential  counterparts,  with  respect 
to  vectorization,  speed  and  numerical  aspects. 
The  method  is  simple  and  independent  of  the 
sequential  solver  used.  One  yields  some  well 
known  as  well  as  many  new  parallel  algo¬ 
rithms,  when  applied  to  standard  sequential 
solvers. 

Introduction : 

In  Numerical  Mathematics  there  is  a  great  in¬ 
terest  in  solvers  for  tridiagonal  systems.  By 
solving  differential  equations,  Iridiagonal  sys¬ 
tems  with  more  than  10000  unknowns  arise. 
There  are  many  sequential  solvers  for  such 
systems,  like  Gaussian  Elimination,  LU  Factori¬ 
zation  or  Cyclic  Reduction  ,  and  each  solver 
has  its  preferences. 

But  often  the  sequential  solvers  are  not  fast 
enough.  Parallel  solvers  are  indispensible.  It  is 
desirable  to  have  a  parallel  counterpart  for  each 

♦  Research  partially  funded  by  DFG.  SFB  124 


sequential  solver.  But  how  to  get  all  these 
parallel  algorithms  ?  A  simple  method  to  paralle¬ 
lize  all  the  algorithms  would  be  very  helpful. 
In  the  following  sections  we  describe  such  a 
method  in  general  and  give  some  results. 

Description  of  the  method : 

Propose  that  the  number  of  processors  p  divides 
(he  order  n  of  the  system.  Partitioning  the 
original  system  A  •  x  •  d 


into  p  subsystems,  each  processor  works  on  n/p 
equations.  The  k  -  th  system  has  the  following 
form; 


j  •  (  k  - 1 )  X  n/p  *  1  r  •  kxn/p 
^1*  ■  ^n*l-  0 


0-8186-2113-3/90/0000/0340$01.00  O  1990  IEEE 
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The  subsystems  are  neither  quadratic  nor  inde¬ 
pendent.  Each  system,  except  the  first  and  the 
last  one,  has  two  variables  more  than  equations. 
Introducing  these  variables  as  parameters,  the 
disturbing  matrix  elements  vanish  and  for 
processor  k  system  ( 1 )  results .  Solving  system 
( 1 )  can  also  be  seen  as; 

Solve  one  tridiagonal  linear  system  with  three 
different  right  hand  sides 

A^^u'^-d^"  A^y^-f*"  A^'z^- g*"  (2) 

and  choose  x^  as  a  linear  combination  of  the 
three  partial  solutions 

X^^.  u^^  -  X.  ,y‘^-  x  ,z^  (3) 

]-l  r  ♦  1 

For  each  fixed  k  system  (2)  is  quadratic  and 
does  not  depend  on  other  systems.  Though  each 
processor  can  solve  its  system  (2)  without  any 
transfer,  using  an  optional  sequential  solver.  It 
suffices  to  convert  the  system  into  diagonal 
form  and  to  change  equation  (3)  a  little  bit: 


a^  x^  *  u^  -  X .  ,  y^  -  x  ,  z^  (3  ) 

]-l  r  ♦  1 

a  is  the  principle  diagonal  of 
the  transformed  system  (2). 

Before  computing  the  global  solution  x  the 
parameters,  which  connect  the  subsystems, 
must  be  determined.  Taking  the  first  and  the 
last  equation  of  each  system  (3  )  (only  the 
last  one  of  the  first  system  and  the  first  one 
of  the  last  system)  the  tridiagonal  system  (4) 
of  order  2p  -  2  arises.  The  two  equations  of 
processor  k  are: 


k  k  k  „  ,k  „  _k 
a .  x.  -  u.  -  X.  ,  y.  “X  ,  z. 

}  )  1  J-1  ) 


k  k  k  k 

a  X  -  u  X.  ,  y 
r  r  r  j-1 


X  .z^^ 
r*l  r 


System  (4)  can  be  solved  in  a  sequential  or 
in  a  parallel  way.  In  order  to  solve  the  system 
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in  parallel,  the  above  algorithm  is  used  recur¬ 
sively.  but  with  halved  number  of  processors. 
For  example,  Gaussian  Elimination  could  be 
used  as  solver.  Each  active  processor  solves 
a  system  of  4  equations  with  3  right  hand 
sides.  After  (  log  p  '  1 )  recursions  only  one 
processor  is  active  and  solves  the  remaining 
system  sequentially. 

Solving  system  (4)  sequentially,  data  must 
be  collected.  After  that  transfer  one  processor 
computes  the  solution  and  sends  the  results 
to  all  other  processors.  The  transfer  can  be 
executed  in  at  most  2  log  p  steps. 

The  whole  computation  consists  of  three  steps: 
First,  each  processor  solves  system  (13  inde¬ 
pendent  of  all  others,  using  an  arbitrary 
sequential  solver.  This  happens  without  any 
transfer. 

Second,  all  processors  together  determine  the 
parameters.  They  only  need  0(logp3  transfer 
and  a  little  extra  computation. 

Third,  each  processor  computes  the  solution 
as  a  linear  combination  of  the  three  partial 
solutions.  This  is  also  possible  without  any 
transfer. 

Now  it  is  easy,  to  give  a  coarse  lower  bound 
for  the  efficiency  of  these  parallel  algorithms. 

OM  -  number  of  all  operations,  whose 
result  only  depends  on  the  matrix 
(  sequential  case  3 

OR  -  number  of  all  other  operations 
especially  those  depending  on  the 
£ight  hand  side  (  sequential  case  3 
TP  -  number  of  all  operations  to  solve 
system  (43  and  to  transfer  data 
( jMrallel  case  3 


sequential  runtime 
Efficiency  -  p  ^  parallel  runtime 

Eff  ^»  (  OR  ♦  OM  3  /  (  OM  ♦  3  OR  .*  TP  •  p  3 

OM  and  OR  are  linear  in  n  and  TP  is  linear 
in  p  in  the  worst  case.  If  the  dimension  n  is 
much  bigger  than  the  number  of  processors 
and  the  startup  time,  then  eff  ^  1/3  (  Let 
OM  3n  and  p  i*  2  then  should  be  p^  •  n  and 
pS  n3.  In  reality  the  efficiency  will  be 
greater,  because  we  have  not  used  the  special 
structure  of  both  new  right  hand  sides.  So  far 
the  analysis  of  our  method  is  independent  of 
the  sequential  solver  used. 

Applications : 

Applying  the  above  method  to  Gaussian 
Elimination  yields  the  well  known  Wang  - 
Decompositon  C 1, 23.  Applying  to  Cyclic 
Reduction  yields  a  parallel  algorithm  presen¬ 
ted  by  GMD  in  March  1989  C2J. 

Now  we  want  to  give  two  examples  and 
compare  them  with  regard  to  efficiency  and 
parallel  runtime.  This  will  be  done  not  only 
for  solving  a  tridiagonal  system  with  one 
single  right  hand  side,  but  as  well  for  a 
system  with  several  right  hand  sides  succes¬ 
sively  fixed. 

As  computational  model  we  use  p  processors 
interconnected  by  a  crossbar,  to  determine 
runtimes.  Floating  point  operations  have  cost 
A  and  a  transfer  with  n  data  has  cost  S  +  n/  B. 
S  stands  for  startup  time  (  time  to  build  up  a 
connection  between  the  processors  3  and  B  for 
bandwidth.  All  other  operations  cost  nothing. 
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Application  to  Gaussian  Elimination; 

The  Wang  -  Decomposition  is  obtained,  when 
applied  to  Gaussian  Elimination. 

Gaussian  Elimination  consists  of  two  phases. 
During  the  first  phase  the  entries  below  the 
principle  diagonal  are  eliminated  and  during 
the  second  the  remaining  system  is  solved 
by  backward  elimination.  It  is  easy  to  see 
that  this  can  be  done  with  8n-7  floating  point 
operations.  For  each  new  right  hand  side, 
which  is  fixed  successively,  the  same  number 
of  operations  is  necessary. 

Together  with  our  method  it  is  better  to  de¬ 
compose  the  second  phase  of  the  Gaussian 
Elimination  into  two  sepatate  parts.  During  the 
first  part  of  th'e  second  phase  the  entries  above 
the  principle  diagonal  are  eliminated  and 
during  the  second  part  the  remaining  system 
is  solved.  This  means  that  system  CD  is  con¬ 
verted  into  diagonal  form  and  equation  (3') 
is  used  to  determine  the  parameters  and  the 
linear  combination. 


Solving  system  (1)  each  processor  has  to  treat 
three  right  hand  sides.  Making  use  of  the 
special  structure  of  the  right  hand  sides  pro¬ 
cessor  k  needs  only  12  (n/p  -  1)  A  opera¬ 
tions,  as  the  following  sequence  illustrates: 


During  the  first  phase  processor  k  computes 
the  values 


a 


k 

i 


a'^ 

i 


c^  /a 
1 


k 

i-1 


f!" 

2 

_k 


1  1-1  1-1 


c^^/a*^  ,  f^ 
1  1-1  1-1 

_k  ,  k  k 
c  .  /a .  ,  g.  , 
1  1  1  ^1- 1 


-  c^  /a 
1 


k 

i-1 


f 


k 

i-1 


for  each  i  =  j  +1  ...  r  and  during  the  second 


phase: 


d^  -  d^-  h^/a^  .  d^. 

I  1  I  1  ♦!  1  ♦! 

f^  ^  f^-b^r/a*"  ,  f^  , 

1  i  1  1*1  1+1 

_k  _  k  L^k  ,  k  k 

g.  -  g.  -  b.  /a .  ,  g.  , 

==1  1  i»l 


b*^  /a^  I  9^  1 

1  1  ♦  1  ®i  +  1 


for  each  i  -  r-1  ...  j  . 

Solving  system  C4)  in  parallel,  each  processor 
therefore  needs  (46  A  +  2S  +  12/B)  logp  -  21 A 
operations.  The  data  flow  for  this  step  is  visua¬ 
lized  in  the  following  picture  for  p  =  8 . 


O  1  2  3  4  5  6  7 


□  solving  systenr\  Cl)  (origional 
or  reduced  system  ) 

B  computing  linear  combination 
C3’) 

■  solving  system  C4)in  the  last 
recursion  on  one  processor 
transfer 

To  determine  solution  x  according  to  equation 
(3  ).  each  processor  needs  5  (  n/p  -  2 )  floating 
point  operations.  The  first  and  the  last  value 
of  X  have  already  been  computed  during 
the  previous  steps.  Altogether  we  get  the 
following  parallel  runtime  : 

Tp  =  C17n/p  -43)  A*Iogp(46A*12/B*2S) 
For  each  new  right  hand  side  fixed  succes¬ 
sively,  the  operations  on  the  vectors  f  and  g 
are  the  same  and  can  be  dropped.  Thus  only 
(13 n/p  -39)  A«logp(46A«12/B«2S) 
operations  are  necessary  for  a  new  right  hand 
side. 
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Application  to  LU  Factorization : 

LU  Factorization  is  the  quickest  direct  solver 
for  tridiagonal  system.  The  algorithm  is  well 
suited  for  systems  with  more  than  one  right 
hand  side,  even  if  they  are  fixed  successively. 
First  the  matrix  is  factorized  A  =  L  •  U  and 
then  two  bidiagonal  systems  are  solved. 
Ax=d  -  L(Ux)=d 
=  Ly^d  &  Ux=y 
L  is  a  lower  and  U  an  upper  bidiagonal  matrix. 
In  the  sequential  case  8n-  7  floating  point 
operations  are  necessary  to  solve  a  system 
with  one  right  hand  side,  and  only  5n  -  4 
operations  for  each  additional  right  hand  side. 
Therefore  it  is  interesting  to  have  a  look  at  the 
parallel  version  ( this  seems  to  be  a  new  algo¬ 
rithm  ).  To  solve  system  (1)  means  now  that 

V 

each  processor  k  factorizes  A  and  solves  the 

two  bidiagonal  systems  for  its  three  right  hand 

sides.  The  second  bidiagonal  system  is  only 

converted  into  diagonal  form  and  for  the 

following  steps  equation  (3')  is  used.  For  the 

factorization  each  processor  needs  3  (n/p-1) 

floating  point  operations.  L  has  only  units  on 

its  principle  diagonal.  Together  with  the  sparse 
k  k 

structure  of  f  and  g  ,  this  fact  leads  to  the 
amount  of  9  (  n/p  -  1 )  floating  point  operations 
to  solve  the  factorized  system  (1).  The  second 
and  the  third  step  can  be  copied  from  the 
Wang  Decomposition.  Thus  the  following 
parallel  runtime  results  : 

T^=ClTn/p  -43)  A*logpC46A*12/B*2S) 
For  each  new  right  hand  side  fixed  succes¬ 
sively,  the  factorization  of  the  matrix  A  and 
the  operations  on  the  vectors  f  and  g  can  be 
dropped.  Thus  for  a  new  right  hand  side  only 
ClOn/p  -36)  A*logpC46A*12/B*2S) 
operations  are  necessary. 


Comparison  of  these  both  parallel  algorithms : 

The  algorithms  are  compared  with  respect  to 
runtime  and  efficiency.  We  give  theoretical 
results  as  well  as  measured  values.  For  theory 
the  following  parameters  are  used: 

A  =  l,  S  =  4,  B  =  2  . 

The  parallel  computer  used  is  a  PARSYTEC 
Megaframe  with  PAR_C  .  The  foating  point  rate 
for  double  precision  numbers  is  very  small. 
Therefore  the  system  has  the  parameters: 

A  =  le,  S  =  4e,  B  -  2/e,  e  =  10'^s. 

The  number  of  processors  and  the  granularity 
are  significant  for  the  quality  of  the  algo¬ 
rithms. 

Granularity  (  G  )  ^  n/p 

Considering  a  system  with  one  single  right 
hand  side: 

Theoretical  results  : 


G 

1 _ 

16 

32 

64 

2000 

1  34137 

1 

34197 

34257 

34317 

4000 

66137 

68197 

68257 

68317 

parallel  runtimes  for  Wang  - 
Decomposition  &  LU  Factorization 


G  P 

8 

16 

32 

64 

2000  1 

46.86 

46.78 

46.7 

46.63 

4000  1 

46.96 

46.92 

46.88 

46.8 

efficiency  (%)  for  Wane 

1  -  Decompwsition 

&  LU  Factorization 
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Measured  results: 


2000  45,2  44.9  44.8 


4000  I  45.8  45.6  45.3 

efficiency  (%)  for  Wang  -  Decomposition 
S  LU  Factorization 

Considering  a  system  with  several  right  hand 
sides: 

A  -  B  x‘  ,  t-0...  t 

A,  B  are  tridiagonal  systems 

Such  problems  arise  e.g.  by  solving  PDE's 
by  the  finite  difference  method  C3]. 

Theoretical  results  :  (  G  =  4000  .  t  -  1000 ) 

n  I  32000  64000  128000 

Ga  415989  831989  1663989 

LU  320088  640184  1280376 

sequential  runtimes  (  unit ;  1000  ) 


p 

16 

32 

par .  Ga 

j  72175 

72235 

72295 

par.  LU 

j  60172 

60232 

60292 

parallel  runtimes  (  unit ; 

1000  ) 

P 

8 

16 

32 

par.  Ga 

72.04 

71.9 

71.9 

par.  LU 

66,49 

66.43 

66,36 

efficiency  (%)  for  t-  1000 


Measured  results  :  (  G  =  4000,  x  =  1000  ) 

n  16000  32000  64000 

Ga  2220  4587  9387 

LU  1709  3423  6844 

sequential  runtimes  (  unit :  1  sec) 


P 

4 

8 

16 

par .  Ga 

783 

809 

831 

par.  LU 

646 

650 

653 

parallel  runtimes  (unit 

:  1  sec) 

P 

4 

8 

16 

par.  Ga 

70.88 

70.8 

70,6 

par.  LU 

66,0 

65.8 

65,6 

efficiency  (%)  for  x  •  1000 

Conclusion; 

The  method  can  be  applied  to  any  solver  and 
though  we  obtain  the  desired  parallel  version. 
It  is  also  possible  to  generalize  the  method  for 
systems  with  more  than  three  diagonals.  For 
m  diagonals,  m  -  1  additional  right  hand  sides 
are  required.  If  k  stays  small  the  algorithms 
are  efficient. 
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2.  The  test  problem: 


Periodic  tridiagonal  linear  systems  of 
equations  typically  arise  from  discretizing  second 
order  diffoential  equations  with  periodic 
boundary  conditions.  Various  vector  algorithms 
are  employed  in  order  to  solve  such  systems, 
namely:  (i)  A  sweeping  technique,  (ii)  Cyclic 
reduction,  and  (iii)  LU  decomposition  methods. 
Implementations  of  the  above  methods  are  carried 
out  on  a  vector  extension  board  of  an  Intel  iPSC/2 
hypercube.  Comparisons  of  the  execution  times 
of  the  utilized  m^hods  for  solving  different  sizes 
of  p^odk  tridiagonal  systems  are  obtained. 

1.  Introduction: 

Periodic  banded  systems  of  equations  typically 
arise  from  discretiring  second  or  higher  order 
differential  equations  subjected  to  periodic 
boundary  ctmditions  [1,2].  In  recent  years,  there 
has  been  a  lot  of  research  in  developing 
algorithms  for  solving  banded,  and  in  particular 
tridiagonal  systems  on  SIMD  machines  [9].  In 
this  paper,  three  vectcx’  methods  are  employed  in 
order  to  sdve  periodic  tridiagonal  linear  systems 
which  arise  from  discretizing  second  order 
differential  equations  with  periodic  boundary 
conditions,  namely:  (i)  a  sweqring  technique,  (ii) 
cyclic  odd-even  reduction,  and  (iii)  LU 
decomposition  methods. 

Implementations  of  the  above  methods  are 
carried  out  on  a  vector  extension  board  of  an  Intel 
iPSCy2  hypercube.  Comparisons  of  the  execution 
time  of  the  utilized  methods  for  solving  diHerent 
sizes  of  periodic  tridiagonal  systems  are  obtained. 
The  structure  of  this  paper  is  as  follows.  We 
describe  a  test  problem,  the  finite  difference 
approximation  to  it,  and  the  periodic  system  of 
equatimis  which  is  a  result  of  this  approximation. 
We  give  a  description  of  the  methods  and  their 
vector  implementations.  We  present  the 
numerical  results  of  the  implementation  on  the 
iPSC/2  hypercube. 


As  an  example  a  test  problem  is  considned, 
namely; 

-y'  +  y  =  2Sinx  ,  0  ^  x  ^  2k  (1) 
y(0)  =  y(2ii). 

/(O)  =  y'(2it) . 


Eq.  (1)  can  be  discretized  by  a  finite-difference 
method.  We  seek  an  approximation  Uj  to  the 
y(Xj),  whoe  Xi  =  (i-l)h,  and  A  is  the 
increment  in  x .  Eq.  (1)  may  be  approximated  by 


(“.>i-2u.  +K.-i) 

A* 


+  Ui 


=  2SinXi 


(2) 


which  can  be  written  as 

-u,_,  +  (2  +  AV.  -  «,>i  =  Bi  (3) 

where  =  2A*  Sinx,-  ,  1  S  i  S  V  .  Using  the 
periodic  boundary  conditions,  Eq.  (3)  can  be 
written  in  the  matrix  form  as: 


a-1  -1 

Bi  ■ 

-1  o  -1 

“2 

Bz 

-1  a  -1 

«»-t 

-1  -1  a 

.  . 

where  a  =  2  +  A^ 


3.  Numerical  methods  for  solving  the  test 
problem. 

i.  A  Sweeping  Technique: 

Eq.  (3)  can  be  solved  by  a  vosion  of  the 
Crank-Nicolson  back  and  fourth  sweep  method 
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for  the  heat  equation  [1].  We  seek  an  equation  of 
thefcHm 

Mi+i  =  a  Ui  +  bi  (5) 

suitable  for  computing  u  explicitly  by  sweeping  to 
the  right  For  stability  we  require  \a\  ^  1. 
Rq)eated  substituticMi  of  £q.  (S)  into  £q.  (3)  to 
eliminate  ^4.1  and  in  favor  of  u,'_i  gives 

bi  +  bi_,(a  -  {2  +  h^) 

+  +  +  1]  = -Bi  (6) 

Requiring  the  ii,_i  term  to  drop  out  determines  a 
(uniquely  since  |  n  |  ^  1)  as  a  solution  of 

fl*  -  i2  +  h^)a  +  1  =  0  (7) 

The  two  roots  of  Eq.  (7)  are 

*2  A  ^ ^ 

A  +  -j"  ^  y  yTTP. 

Requiring  |a  |  ^1,  we  take 

a  =  1  +  A - A  +  From  Eq.’s  (6)  and 

(7)  we  get 

A-i  =  abi  +  a  Bi  (8) 

It  follows  that  the  b ’s  can  be  computed  explicitly 
by  sweeping  to  the  left  To  obtain  the  u ’s,  first 
solve  for  the  b 's  from  Eq.  (8)  then  use  Eq.  (S)  to 
calculate  the  u ’s.  In  order  to  calculate  the  b ’s  we 
use  an  iteration  procedure.  We  assume 

bjv4i  =  constant  value 

to  start  with  and  then  we  apply  the  Gauss-Seidcl 
technique  [3]  (in  which  the  improved  values  are 
used  as  soon  as  they  are  computed)  to  calculate 
the  rest  of  the  b 's.  The  calculated  value  of  the 
f>i(=  f*/v4i)  is  used  to  start  the  new  iteration,  and 
the  iteration  procedure  is  repeated  until  the 
condition 

Ibi  -  b,v+il  <  tolerance 

is  satisfied.  Then  we  use  the  above  procedure  by 
sweeping  to  the  right  by  means  of  Eq.  (5)  to 
obtain  the  u's.  This  method  is  well  suited  for 
serial  computers  but  not  for  pipeline  or  vector 
computers  due  to  the  recursive  nature  of  Eq.’s  (S) 


and  (8).  In  this  paper,  the  cyclic  reduction 
method,  to  be  discussed  later,  is  used  to  sdve  the 
bidiagonal  linear  systems  with  a  nonzoo  element 
on  the  lower  left  hand  comer  and  a  non  zero 
element  on  the  upper  right  hand  comer  generated 
from  Eq.’s  (8)  and  (5)  respectively.  It  is  to  be 
noted  that  since  the  boundary  conditions  are 
periodic  then  b^  =  bjy^i  and  Uj  =  us+i-  The 
systems  of  equations  to  be  solved  given  in  the 
matrix  forms  are: 


ii.  Cyclic  reduction  method  [5-8] 

In  this  paper,  the  cyclic  odd-even  reduction 
method  [6]  is  adapted  in  order  to  solve  a  class  of 
periodic  tridiagonal  linear  systems  such  as  the  one 
given  in  Eq.  (4).  This  method  is  related  to  the 
cyclic  reduction  algorithm  developed  for  the 
numerical  solution  of  Possion’s  equation  on  a 
rectangle  by  Hockeny  [7].  The  main  idea  of  this 
algorithm  is  the  elimination  of  the  odd-even 
unknowns  and  their  eventual  recovery  through 
back  substitution,  i.e.,  this  algorithm  generates  a 
sequence  of  problems  =  b^'^  such  that 

X/'*”  =  XijK  and  then  recovers  the  odd-indexed 
unknowns  of  X^‘^  given  through  back 

substitution.  For  convenience  the  system  in  Eq. 
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(4)  can  be  written  in  a  general  form  as  AX  -  b, 
where 


A  = 


d,  A 

ez  dz  fz 


a 


<«-i  A-i 
««  d^ 


(11) 


X  = 


■^1  ■ 

■  ■ 

Xz 

bz 

- 

.  b  = 

- 

Xn-x 

Cx 

. 

.  . 

Suppose  A  =  («.  .d,./i)w;dv  .«! 

=  A  =  0 

.  and 

W  =  2 

Setting  Ni  =  2""**'.  we 

let 

A<‘)  = 

(e/‘\d/‘>.//‘>);«^.xC> 

II 

*«jjk 

fi(‘)  = 

(4V- 

with 

A<°>  = 

=  b. 

and 

=  OL 

Then, 

for  1  =  0, ....  m  -  1  and 

J  —  1, 

Nm, 

define 

=  d^,  . 

a(’)  =  d,('>  a<‘J . 
and  for  y  =  1, ....  A^j+j  -  1  define 

. 

rr  =  . 

xr^  =  xij>. 

=  b^  -  -f^b^, 

and 

a(.»)  =  _a(.)^p  , 


4:*’  =  ^’  . 

=4>  -pa«-4>4>,. 

4:”  =  4’  -P*p  -4^4^.. 

P  =  -P/f*’ 

It  is  to  be  noted  that  the  reduced  system  has  the 
same  form  as  A,  and  only  the  even-indexed 
unknowns  appear.  The  reduction  is  continued 
until  we  have  a  block  2x2 

system  which  is  solved  by  Gaussian  elimination. 
For  the  backward  substitution  step.  X^^  is  found 
by 

x^j>  =  j  =  1 . 

xp  =  -/PxP 

X^lx  =  b^l,  -  eJ^l,  X^ 

=  2 . . 

iii.  LU  decomposition  [4^1 

In  this  algorithm  we  assume  that  the  LU 
decomposition  of  A  exists;  that  is,  A  =  LU 

Whi.C 


L  = 


U  = 


1 

h  1 
h  1 


/«-!  1 

Xx  i2  g,_2  im-X  1 


U)  f  X 

“2  fl 


(12) 


hx 

hz 


“•-1  /■-2  K-1 

“.-1  *»-i 

n. 


with 

u,  =  =  a. 


(12.1) 
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«.  =  di  -  t  i  —  ,i  =  2 . n  -  1  (112) 


by: 


^1 

,1=2,  ...,  A 

(12.3) 

^‘—1 »  i  —  2, ....  A  1 

(12.4) 

K-i  +  /n-i  t 

(12.5) 

1. 

(12.6) 

/•-I  .%  , 

-ft-J  .1=2 . A  -  1 

•*i 

(12.7) 

Sm-l  • 

(12.8) 

«i  =  di,  hi  =  a, 

i=2,....n-l 

ti 

Ui  =  if.  -  - — ,  i  =  2 . A  -  1 

«,_i 

1  1 

Mi  =  -  ,  1  =  1, ....  It 

Ui 

h  =  “f-i .  f  =  2, ....  n 

hf  =  if  j ,  i  2, ,...,A  1 

+  /■-! 


H-l 

U/t  ~  d^  ~  gihi 
isl 


(12.9) 


The  system  LI/x  =  b  is  then  solved  by  die 
forward  substitution  Ly  =  b ,  which  has  the  form 


i\  =  P«i 

Zj  =  f  i—i  Uj,  i  =  2, ...,  A  —  1 

Si  ~  ~Si—\  ~  2,  ...,  A  —  1 


>1  =  *1. 


^■-1  ~  s»~i  (« 


y,  =  b,  -  /,  y,_, ,  i  =  2, .....  a  -  1  (13) 

H-l 

y.  =  -  L  SiVi 

i=l 

followed  by  the  back  substitution  Ux  =  y,  which 
gives 


Xn 


d,  -  i,  Si  hi 

>1  =  *1. 

Eq.  (13)  is  solved  by  using  the  Veclib  routine 
dibidi  (»i  the  Intel  iPSCy2  and  can  also  be  solved 
by  the  cyclic  reduction  method  for  bidiagonal 
systems. 


>.  =  yi  -  hi  x^,i  =  1 . A  -  1. 


Xi  =  (yi  -  fi  Xi^i)IUi  ,  i  =  A -2 . 1  (14) 

The  above  algorithm  is  well  suited  for  serial 
machine,  but  not  for  vector  machines  due  to  the 
recursive  nature  of  Eq’s.  (12.2),  (12.7),  (13),  and 
(14). 

In  this  ^gapa  we  consider  an  implementation 
of  the  ///  method  which  is  suitable  for  vector  and 
pipeline  computers.  This  implementation  is  given 


H-l 

y^  =  K  -  'L  Si  yi » 

i=I 

X,  =  y«  M, 

y,  =  y,  -  hi  ,  I  =  1,...,  A  -  1 

Xi  =  X  «j  .  »  =  1,  -  1 

f  i  “  f  i  Ui  ^  i  =  1,...,A  2 

Xi  -  Xi  -  fi  i  =  A  -  2 . 1  (15) 

Eq.  (IS)  is  solved  by  calling  the  Veclib  routine 
dubidi  and  can  also  be  solved  by  the  cyclic 
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reduction  algorithm  for  bidiagonal  systems. 

4.  Numerical  experiments  and  results 

All  of  the  three  algmithms  are  implemented 
on  the  vector  extension  board  of  an  Intel  iPSC/2 
hypercube.  It  is  found  that  the  sweeping 
technique,  using  the  cyclic  reduction  method  to 
solve  the  bidiagonal  systems,  is  the  most  efficient 
one,  followed  by  the  cyclic  reduction  and  finally 
by  the  LU  decomposition.  It  is  found  that  the 
sweeping  technique  is  about  1.13  times  faster  than 
the  cyclic  reduction  and  about  2.2  times  faster 
than  Ae  LU  decomposition.  Also,  it  is  found  that 
the  sweeping  vector  method  runs  about  10  times 
faster  than  its  serial  version,  the  cyclic  reduction 
is  about  8  times  faster  than  its  serial  version,  and 
the  LU  decomposition  is  about  3  times  faster  than 
its  soial  version. 
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ABSTRACT 

The  Parallel  Diagonal  Dominant  (PDD)  Algorithm 
has  been  proposed  for  solving  certain  types  of  tridiagonal 
linear  systems.  The  algorithm  employ  a  matrix  approxi¬ 
mation.  Both  theoretical  and  experimental  results  have 
shown  that  the  PDD  algorithm  is  a  higfdy  efficient  paral¬ 
lel  algorithm  for  a  variety  of  architectures.  In  this  paper, 
the  ^ect  of  this  approximation  is  studied  and  a  rigorous 
error  analysis  is  given.  The  numerical  results  are 
presented. 


Introduction 

The  Parallel  Diagonal  Dominant  (PDD)  Algorithm 
has  been  proposed  in  [1]  for  solving  tiidiagonal  linear 
systems.  The  algorithm  is  based  on  the  divide  &  conquer 
model  of  parallel  computation.  First,  a  linear  system  of 
order  n  is  divided  into  p  subsystems  which  can  be  solved 
concurrently  by  p  processors.  Then  the  subsolutions  are 
modified  by  the  solution  of  a  conquer  system  of  order 
2(p-\)  to  obtain  the  solution  of  the  original  system. 
Under  certain  assumptions,  the  coefiGcient  matrix  of  this 
2(p-l}-dimensional  conquer  system  converges  to  a  diag¬ 
onal  block  matrix  at  least  exponentially  as  n/p  ->  <».  This 
indicates  that  we  can  take  this  diagonal  block  matrix 
instead  if  n  »  p.  It  reduces  the  communication  cost  and 
the  algorithm  almost  reaches  ideal  p  speedup  if  p  proces¬ 
sors  are  used.  Both  theoretical  and  experiment  results 
have  shown  that  the  PDD  algorithm  is  a  highly  efficient 
parallel  algorithm  for  a  variety  of  architectures.  In  this 
paper  we  study  the  accuracy  of  the  results  of  the  PDD 
algorithm,  discuss  how  the  solution  of  the  system  is 
affected  by  the  matrix  tq^proximation  and  give  a  rigorous 
error  analysis. 

Section  1  briefly  describes  the  PDD  algorithm.  Sec¬ 
tion  2  derives  the  error  bound.  Computational  results  are 
presented  in  Section  3. 


1.  The  paralki  diagonal  dominant(PDD)  algorithm 


The  PDD  algorithm  is  used  to  solve  the  linear  sys¬ 
tem  of  the  form 

Ax  =  d,  (1.1) 

where 

A  =  (a,,fi;,c.  )  (i=0 n-1)  (1.2) 

is  an  nxn  tridiagonal  matrix  satisfying  certain  conditions. 
For  convenience  we  assume  that  n  =  pm.  A  tridiagonal 
matrix  A  is  called  evenly  diagonal  dominant  if 

lOi  l<\bif2l,  ICi  l^lfii/21  anda(^i  c,>0.  (1.3) 

Suppose  matrix  A  in  (1.1)  is  evenly  diagonal  dominant. 
Scaling  both  sides  of  (1.1),  without  loss  of  generality,  we 
assume  that  matrix  A  =  ( )  satisfies 

la,  I  <1,  lc,-l  5  1, a,  ^2, 
and 

>  0.  (1.4) 

Matrix  A  can  be  written  as 

A  =  A  +  AA  (1.5) 

with 
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AA 


(j  =0,...j}-\).  It  is  proved  in  [1]  that 

'all=nK^l’/det(Ap. 

i-0 


m—\ 

Ic2>-,  l  =  niCi^^dct(A>).  0-9) 

i-O 


The  submatrices  Aj  =ia^\b^\c^^)  are  mxm 
matrices.  Let  e,  be  a  column  vector  with  its  tth 
(0  S  i  5  n  -1)  entry  being  1  and  all  the  other  entries  being 
zero.  The  AA  can  be  expressed  as  AA  =  VE  ,  with 


for  j  =  1 . p-2and 

max(\a2j\,  Ic^i  l)S -  (1.10) 

(OT+l)(l+e/4)" 


*111 . *(p-l)»i-l*  *(p-l)(»i  j  • 

both  V  and  E  are  nx2(p-l)  matrices.  Based  on  the 
matrix  modification  formula[2],  the  solution  of  (1.1)  is 

x  =  A"‘d  =  (A  +  VE’')'*d 
= A“*d-A"‘ Va+E^  A"’  V)"'  e’"  A"*d.  (1.6) 

Introducing  a  permutation  matrix  P,  the  band  width  of  the 
matrix  I+E^a"  v  can  be  reduced  from  5  to  3.  The  solu¬ 
tion  of  (1.1)  then  becomes 

X  =  A‘'d  -  A"‘vPZ"’e^ A~‘d  .  (1.7) 

where  Z  =  P  +  E^A  'vP  is  a  2(p-l)x2(p-l)  tridiagonal 
matrix.  The  equation  (1.1)  is  solved  in  the  following 
steps: 

1.  Solve  Ax  =  d.  AY  =  VP. 

2.  Fbrm  h  =  E^x.  Z  =  P  f  e’^Y. 

3.  Solve  Z  y  =  h. 

4.  Compute  Ax  =  Yy,  x  =  x  -  Ax. 

Since  matrix  A  is  a  diagonal  block  matrix,  step  1  can  be 
done  concurrently.  It  is  equivalent  to  solve  three  subsys¬ 
tems  of  order  m  with  same  coefficient  matrix.  There  is  no 
computation  at  step  2.  Due  to  the  special  structure  of 
matrix  Y,  step  4  can  be  computed  by  p  processors  simul¬ 
taneously.  Difikrent  ways  of  solving  the  system  at  step  3 
result  in  different  algorithms.  Let  us  denote  the  matrix 

Z  =  ( fl‘.  b],  c] )  ( i  =  0 . 2(p-iyi )  (1.8) 

and  the  determinants  of  the  submatrices  Ay  as  det(Ay) 


with  e  =  min  (b^-l ).  When  e  >  0.  the  inequality 

OSiSn-l  ,  , 

(1.10)  shows  that  the  off-diagonal  elements  Ojy,  Cjy.j 
(y  =  l....,p-2 )  of  Z  converge  to  zero  at  least  exponen¬ 
tially  as  m  =  n  /p  -»  oo.  From  this  result,  the  PDD  algo¬ 
rithm  uses  matrix 


instead  of  Z  when  n  »  p  at  step  3.  This  matrix  ^Tproxi- 
mation  removes  the  bottleneck  of  the  computation  and 
makes  the  algorithm  highly  parallel.  It  also  reduces  the 
communication  cost  into  two  neighboring  communica¬ 
tions  only.  Both  theoretical  and  experimental  results  have 
shown  that,  using  p  processors,  the  PDD  algorithm  almost 
reaches  p  speed-up  when  the  matrix  condition  is  permit¬ 
ted.  For  the  detailed  description  of  the  PDD  algorithm, 
the  reader  may  refer  to  [1]. 


Z  = 


*0  ^0 


fc*  c* 
^>3 


2.  The  accuracy  of  the  PDD  algorithm 

The  PDD  algorithm  uses  the  approximate  matrix  Z 
instead  of  Z  at  step  3.  It  then  raises  the  question  about  the 
accuracy  of  the  result.  It  has  been  claimed  in  [1]  that  if  Z 
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equals  Z  within  machine  accuracy,  the  error  between  the 
approximate  and  the  exact  solutions  would  have  the  same 
ord^  of  magnitude.  In  this  section  we  study  the  effect  of 
the  matrix  approximation  in  detail  and  give  a  rigorous 
error  analysis. 

For  simplicity  we  assume  that  the  evenly  diagonal 
dominant  matrix  A  =  ( U;. h,-.  c,- )  in  (1.1)  satisfies  (1.4) 
with 


Proof.  Z  is  a  diagonal  block  matrix.  The  j  di  block 
ofZis 


‘  t  *  I  • 

*2/  Cjy 

■*2y  1 

1  .  s 

.**2y+l  ®2y+l  . 

.  *  *2y+l  . 

The  eigenvalues  of  the  )  th  block  are 


e  =  min  ( -  2 )  >  0 .  (2.1) 

OSiSH-l 

Define  AZ  =  Z  -  Z,  i.e.. 


AZ  s 


1-4 - u 


+  - I - 


It  is  easy  to  verify  that 


max  (la*^.|,lc*^_,  I)  (2.3) 

li/Sp-2 

1 

5 - ,  from  (1.10). 

(m+l)(l+e/4)" 

Here  and  throughout,  ||  ||  denotes  the  spectral  matrix 
norm[3,  pg.81].  Let  y  +  Ay  be  the  exact  solution  of  the 
system 


Z  (  y  +  Ay )  =  h  ,  (2.4) 

then  ZAy  =  AZy  because  Z  y  =  h  (see  step  3  of  section 
2).  It  has  been  {»oved[l]  that  Z  is  invertible,  thus 

||Ayp||Z''||||AZ||||yll.  (2.5) 


The  relative  error  for  the  solution  of  Zy  =  h  is  bounded 
by||Z  llljAZ]^  An  upper  bound  of  II  Z  '||is  given  by 

Pl-oposition  2.1.  Let  Z  be  defined  in  (1.11).  Then 


l|Z''||S 


3(  H-  e ) 
e 


(2.6) 


=  - 


Let  o(Z)  be  the  spectrum  of  Z,  then 


min  IXI  min  (IX.^^,  IX^M) 

Xeo(Z)  OiJSp-2 

It  is  sufficient  to  find  the  lower  tound  of  I X^^  I .  Suppose 

submatrices  Aj  =  (  ay\ b^\  )  0=0 . p-1)  in  (1.5) 

have  LDU  factorization 

A,  =(/,‘^\l,0)(0.5V\0)(0, 1,«.^>) 

then  (see  [  1 ,  Appendix)) 


^m-\ 


(2.7) 


.y+i)  0+1) 
1 


*1+1  =  ( - + - + 

8<"'’ 


.0+1) 


.0+1).  0+1) 
^m-2 


0+1) 


•  •  «».-l 


-K 


0+1) 


(2.8) 


Using  mathematical  induction,  equation  (1.4)  and  the  fact 
that  ^*i-i .  ,  we  obtain 


i  +  2 


(2.9) 


i  +  1 

It  then  follows  that,  for  any  r  <m , 

(1)  2  3  4  r  +2 

S^-  S,  S  (e+— )(e+-)(e+— )  -(e+ - )  (2.10) 

123  r+1 


with  e  defined  by  (2.1). 


^(r+2)  +  eM(r),  (2.11) 
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where  r  +1  and  M  (r)  are  the  constant  and  the  coefOcient 
of  e  for  the  polynomial  on  the  right  hand  side  of  (2.10). 
Af  (r )  can  be  written  as  a  finite  sum 


Inequalities  (2.13)-(2.16)  imply  that 


(2.17) 


3  4  r+2  2  4  5  r+2 

M{r)= - + - +• 

2  3  r+1  13  4  r+1 


=(/-+2)Z - +'•+!• 


(2.12) 


Using  (1.4)  and  (2.7)-(2.12),  it  yields 


\b\,\< 


1  1 


5«-i 

"  ‘  e+ - 


m 


m 


me+OT+1 


(2.13) 


(/♦i)  (/+i) 

Ca  u  I 


Ihj;  .  1  <  +  + 

c(/+i)  /s:(/+lK2sy+l) 


%  (8o^  )8r 


.0+1) 


0+1)  0+1) 
C«-2 


0+1) 


/cO+l) 

(oq 


j.y+i).2j.y+i) 

8^_2  )  o„_i 

2  3 


2+eM(0)  (2+eM(0))^-3  (3+eM(l))^4 


m 

...  +  - 

(OT+£Af(OT-2))  («+!) 

1  "  i 

- +£ - - - .  (2.14) 

2+tM  (0)  ._2  (i  +eAf  (/•  -2)ni +1) 

Let  Af ,,  be  the  expressions  on  the  right  hand  sides  of 
(2.13)  and  (2.14),  we  obtain 

m  1 

Af,  = - < -  (2.15) 

me  +  m+1  e  +  1 

1  "  I 


Af2= 


+1- 


2+eAf(0)  ,.2(i+eAf(i-2))^(j+l) 


(2.16) 


<7  +  1 


m 


<  1 . 


2  ,.2  )  m  +  1 


82/82/+i<*  forj=0,_.,p-2. 

From  (2.7),  (2.8)  and  (1.4),  b^j  and  bjy+i  have  the  same 
sign.  Without  loss  of  generality,  they  are  assumed  to  be 
positive.  Using  (2.13)-(2,14), 

min(  )  = 

2 

_ 2(l-8U2y  +  l) _ 

(82)+82y>,  )+V(82y  +*iy  +  l)"+4(l-hiyh*2y  +  l) 
2(l-A/,Af2) 

^ ^ —  (2.18) 

(Af  j  +A#2)+ V(Af  1  +Af  2)^+4 

1 

2(  1  - - ) 

1  +  e  e 

> -  > - . 

2  +  V2^  +  4  3(l+e) 

The  lower  bound  of  then  yields  the  inequality 

(2.6).  □ 

Remark.  Note  that  the  lower  bound  (2.18)  is  much  better 
than  e/3(l+e)  except  that  the  expression  in  (2.18)  is  more 
complex.  Using  (2.18)  a  better  upper  bound  of  ||  Z  ||  is 

(Af,+Af2)+V(A^i+Af  2)^+4 

HZ'  ||< - ,  (2.19) 

2(l-Af,Af2) 

where  Af ,,  M2  are  defined  by  (2.13)-(2.14)  and  Af  (r)  is 
defined  in  (2.12). 

Suppose  n  »  p,  as  wc  mentioned  before, ||  AZ || can 
be  as  small  as  machine  accuracy.  Without  loss  of  goieral- 
ity,  we  can  assume_  that  ||Z  ||  ||AZ||<  1.  Since 
Z  =  Z-fAZ  =  Z(I-fZ  AZ) , matrix Z is invertibie and 

||Z-'||  =  ||(I  +  r'AZ)"'z''ll  (2.20) 
^  liz'll 

1-||Z-‘||||AZ||  ■ 

The  absolute  error  ||YAy||  for  the  original  system  Ax  =  d 


3S4 


satisfies 


niques,  we  may  able  to  find  better  ciror  bound. 


|lYAyll  =  ||A"'vPAy|| 

5||A“’l|||Ay|l5|iA-'||||Z'‘||||AZ||||Z-'l|||A'‘d|| 

by  (2.5)  and  y  =  Z"‘E^A"‘d.  Combining  (2.20)  we 
obtain 

II  a"'  ||||Z“’|h|AZl|l|A"’d|| 

llYAyp  - - - - — - - - -  .  (2.21) 

I-IIZ"  IIIIAZII 

The  relative  error  consequently  satisfies 

||YAy|KllA~'||||Z-‘|f||AZ||  ||A-*d|| 

||x||  l-|ir'||l|AZ||  ||A-‘d||' 

Let  us  assume  that  ||A  'dj[|  and  ||A  *d||  have  about  the 

II A  dll 

same  magnitude  (i.e., - =0(1))  which  usually  is 

l|A"'d||  ..I 

true  in  practice.  Using  the  fact  ||A  ||^  1/e,  an  approxi¬ 
mate  upper  bound  for  the  relative  error  is 

- - IIAZII  (2.23) 

e(l-|lz"llllAZll) 

where  the  bound  of||Z"‘  l|and||AZ||are  given  by  proposi¬ 
tion  2.1  and  (2.3)  respectively. 

The  bound  (2.23)  indicates  that  if  e  is  not  too  small 
and  n  »  p ,  the  PDD  algorithm  offers  a  satisfactory  solu¬ 
tion. 


Tablet 


lAZil 

Rebitivc  Etror 

Computed 

Euimiucd 

Computed 

Fatimauid 

IO-» 

1.9x10'" 

i5xio-" 

HibclO*® 

S.lxio’ 

nr* 

3.6x10'“ 

2A<10r“ 

3.2x10*“ 

3AX10' 

KT* 

lOxlO-** 

2JxUr® 

3.1xl0r“ 

4.5xl(lP 

10-* 

7.8xl0r‘» 

9.2x10*“ 

Oh 

3.4x10"' 

itr* 

9.2x10'“ 

UxlO-" 

OA 

2.4x10*“ 

1.0 

5.5x10*'“ 

dJxlO"® 

0.0 

4.1x10**' 

iv>6400.i^l6.n/t>-400 


Tablet 


II6Z1I 

Kelauve  Etror 

Computed  Eittinuticd 

Computed 

Estimaied 

10"* 

4.0x10"*  6.2xl(r“ 

1.9x10"“ 

7.7x10' 

lO** 

2.2x10'“  6.0x10*“ 

2Axl0*“ 

8.Kxlfll* 

io-> 

6.5x10"“  4.2x10*“ 

0.0 

8.4xl0*“ 

io-» 

6.3x10"’'  l.lxlir® 

0.0 

4.3x10"® 

10*' 

6.9x10"’*  43x10"*' 

0.0 

8.1x10"'* 

1.0 

OA  5.5x10*'** 

OA 

5.3x10"'** 

n“6400,  p=4,  i|/psl600 


3.  Numerical  Results 

The  experiments  were  conducted  for  the  ^stem 
A  X  =  d  with  A  =  ( 1, 2+e,  1)  and  d  =  ( 1, 1,...,  1)  .  The 
computations  were  done  in  double  precision.  The  numeri¬ 
cal  results  and  our  estimates  are  listed  in  Table  1  and 
Table  2.  We  use  (2.3)  as  the  estimated  |1AZ||  and  (2.19), 
(2.23)  as  the  estimated  relative  error  for  the  solution  of 
the  system. 

Both  the  theoretical  analysis  process  and  the 
numerical  results  indicate  that  the  real  computational 
errors  can  be  much  smaller  than  our  estimation.  Also  as 
we  mentioned  in  [1]  that  the  assumption  on  the  matrices 
is  sufficient  but  not  necessary.  The  algorithm  can  be 
applied  to  any  positive  definite  or  diagonal  dominant 
matrix  whose  resulting  elements  _  * 

(/  =  l,...^-2)  become  undcrllow  and  ||A  ||f|Z  |[  is  not 
too  large.  Fbr  the  special  muuiccs,  using  similar  icch- 
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Abstract 

This  paper  discusses  a  parallel  FFT  algorithm 
and  its  implementation  on  iPSC/2  hypercube. 
Numerical  experiment  and  performance  model 
analysis  show  that  parallel  FFT  algorithm  can  be 
implemented  on  hypercubes  efficiently.  The  in¬ 
ternode  communication  cost  is  minimised  by  elim¬ 
inating  fragmentary  message  passing,  overlapping 
communication  and  computation  and  mapping  all 
data  groups  that  need  mutual  communication  into 
the  closest  neighbors  on  a  hypercube. 

Introduction 

FFT  is  one  of  the  most  widely  used  numerical 
method  in  science  and  engineering,  especially  in 
the  area  of  signal  and  image  processing,  time  se¬ 
ries  and  spectral  analysis,  computational  mechan¬ 
ics  and  the  numerical  solution  of  partial  differen¬ 
tial  equations  (  such  as  solving  Poisson’s  equa¬ 
tion  by  fast  direct  method)  [9].  For  sequential 
FFT  algorithm,  there  have  been  many  develop¬ 
ments  and  improvements  during  the  past  twenty 
years  since  FFT  was  first  introduced  by  Cooley 
and  Tukey  (1965),  see  [2],[5j,  [6],  [7],  for  details. 
Along  with  the  development  of  vector  comput¬ 
ers,  FFT  algorithm  has  also  been  modified  and 
implemented  on  computers  with  vector  process¬ 
ing  facilities,  see  [1],  [10],  [13]-(17].  Theoreti¬ 
cal  and  numerical  experiment  analysis  show  that 

’The  author  thanks  Cornell  Theory  Center  for  provid¬ 
ing  access  of  its  iPSC/2  hypercube 


FFT  is  highly  amenable  to  vector  or  SIMD  archi¬ 
tectures  and  significant  improvements  in  perfor¬ 
mance  have  been  achieved  on  this  class  of  com¬ 
puters. 

Recently,  the  problem  of  implementing  FFT 
algorithm  on  multi- processor  parallel  computers 
with  shared  memory  (MIMD  machine)  was  inves¬ 
tigated  by  Briggs,  et  al.  (1987)  and  mathematical 
performance  models  were  established  to  quantify 
the  analysis  of  the  algorithm  performance.  They 
demonstrated,  through  numerical  experiment  and 
theoretical  analysis,  that  the  implementation  of 
FFT  algorithm  on  shared  memory  parallel  com¬ 
puters  has  been  quite  successful. 

While  discussing  general  problem  solving 
strategies  on  parallel  computers  with  distributed 
memory.  Fox  (1988)  and  Walker  (1988)  studied  a 
parallel  FFT  algorithm  on  hypercube  as  an  exam¬ 
ple.  The  algorithm  is  based  on  a  group  of  general 
purpose  message  passing  routines  for  internode 
communications.  Once  the  communication  rou¬ 
tines  are  implemented  on  a  target  machine,  the 
algorithm  can  be  ported  to  the  new  machine  with¬ 
out  difficult. 

The  purpose  of  this  paper  is  to  investigate  the 
implementation  of  FFT  algorithm  on  iPSC/2  hy¬ 
percube  and  present  results  of  numerical  exper¬ 
iments  and  performance  model  analyses.  Both 
computation  results  and  theoretical  analyses 
show  that  the  communication  cost  can  be  min¬ 
imized  by  eliminating  fragmentary  message  pass¬ 
ing,  overlapping  communication  and  computation 
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Figure  1;  Flow  chart  of  FFT 


and  mapping  all  data  groups  that  need  mutual 
communications  into  the  closest  neighbors  on  a 
hypercube. 


The  Algorithm: 

There  are  many  variants  of  FFT  algorithm  in 
the  references.  We  started  by  introducing  paral¬ 
lelism  in  the  algorithm  discussed  in  [4].  To  illus¬ 
trate  the  parallel  algorithm,  we  use  the  following 
example. 

For  a  given  array  of  complex  elements 
A'o(fc),  fc  =  0,  •  •  ■ ,  15,  the  final  result  of  its  FFT, 
denoted  by  A'4(A;),  can  be  calculated  in  four  steps 
as  shown  in  Fig.  1.  where  ru*  =  eip(2xil/16), 
I  =  \/-^.  Each  column  in  Fig.l  represents  the 
array  of  intermediate  result  during  the  computa¬ 
tion.  Every  point  (except  points  in  the  first  col¬ 
umn)  is  an  element  in  the  array  of  intermediate 
result  and  is  entered  by  two  solid  lines  represent¬ 
ing  transmission  paths  from  previous  points.  A 
solid  line  transmits  or  brings  a  value  from  a  point 
in  the  previous  step,  multiplies  the  value  by  tu', 
and  inputs  the  result  into  the  point  in  the  next 
array.  Factor  w'  appears  neat  the  arrowhead,  ab¬ 
sence  of  this  factor  means  ii;'  =  l.  Results  entering 
a  node  from  the  two  transmission  paths  ate  com¬ 
bined  additively.  It  is  clear  that  in  the  array  of 


each  step,  points  can  be  classified  into  pairs,  the 
two  points  in  each  pair  has  input  transmission 
paths  stemming  from  the  same  pair  of  points  in 
the  previous  array.  We  call  the  two  points  in  such 
a  pair  a  dual  point  pair.  Each  dual  point  pair  in 
X,{k),  i  =  1,  ■  •  ■ ,  4  is  calculated  by  using  a  corre¬ 
sponding  dual  pair  in  Xt-i{k)  and  multiplication 
of  a  complex  exponentiation.  For  example; 

A',(0)  =  .Yo(0)  +  lu^YoCS) 

Ai(8)  =  A'o(0)-(-u;''Ao(8)  (1) 

=  A'o(O)  -  w°Xoi8) 

Other  pairs  can  be  obtained  in  similar  manner. 
The  index  difference  between  points  in  dual  pair 
(originally  8)  is  halved  in  every  step  until  it 
reaches  1  in  last  step,  which  means  the  dual  pair 
now  contains  adjacent  elements.  In  each  inter¬ 
mediate  step,  the  computations  for  different  dual 
pairs  are  independent  and  can  be  done  in  parallel. 
This  is  the  point  where  the  parallelism  is  intro¬ 
duced.  Similarly,  elements  in  the  array  A*  can 
also  be  cut  into  segments  and  the  computations 
for  different  dual  segment  pairs  can  be  done  in 
parallel.  For  example,  Ai(0  :  3)  and  Ai(8  :  11) 
form  a  dual  segment  pair,  while  Xx{4  :  7)  and 
A'i(12  :  15)  form  another  pair. 

In  general,  if  there  are  N  elements  in  Xo(k) 
(for  simplicity,  we  consider  the  case  N  =  2'*,  fj, 
is  an  integer),  we  need  ft,  steps  to  finish  the  com¬ 
putation.  In  each  step,  the  array  A'i(fc)  is  cut 
into  segments  and  the  computations  for  different 
dual  segment  pairs  can  be  done  in  parallel.  From 
Fig.l,  it  is  clear  that  to  obtain  a  single  dual  pair 
in  the  final  array  A4(0  :  15),  say  A'4(0)  and  A'4(l), 
one  single  dual  pair  in  A3,  i.c.  A3(0)  and  A3(l), 
is  needed  for  the  computation,  which  in  turn  re¬ 
quests  another  two  dual  pairs  from  Aj,  i.e.  A2(0), 
A2(1),  A2(2)  and  A2(3).  If  Ao(0  :  15)  is  initially 
distributed  over  different  nodes  on  a  hypercube, 
the  subsequent  computations  on  a  node  depend 
on  intermediate  result  of  other  nodes.  It  is  this 
data  dependence  that  brought  up  the  need  for 
inter-processor  communication. 

Suppose  there  are  p  =  2**'  nodes  on  a  hyper¬ 
cube  available  for  the  computation,  a  most  com¬ 
monly  used  way  to  distribute  the  data  array  A'o 
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into  different  nodes  is  to  divide  Xq  sequentially 
into  p  segments  of  equal  length  n  and  then  assign 
one  segment  to  each  node.  In  each  computing 
step  that  requires  communication,  a  data  segment 
of  length  n  must  be  exchanged  between  nodes  for 
the  next  step  computation.  To  reduce  the  amount 
of  data  transferred  during  the  communication,  we 
divide  A'o  sequentially  into  2p  segments  of  equal 
length  which  form  p  dual  segment  pairs  initially, 
then  assign  one  dual  segment  pair  to  each  node  to 
start  the  computation.  In  each  intermediate  step 
that  needs  communication,  every  node  sends  half 
of  its  computed  result(length  n/2)  to  the  nodes 
which  need  it  for  the  next  step  computation,  and 
wait  for  the  information  of  same  length  from  an¬ 
other  node  to  arrive  for  continuing  computation. 
Fig.  2  shows  the  information  flow  during  the 
computation  in  the  case  of  a  hypercube  with  4 
nodes.  Note  in  each  step  that  needs  communi¬ 
cation,  only  half  of  the  data  stored  in  a  node 
is  exchanged  with  another  node.  A'o  is  divided 
into  8  segments  A'o,i  =  0,  -  ,7.  The  circles  in 

Fig. 2  represent  nodes,  the  letters  outside  the  cir¬ 
cles  indicate  the  data  segments  stored  in  the  local 
memory.  At  the  first  step.  A®  and  A'q  are  in  the 
same  dual  segment  pair,  so  are  A'q  and  A'®  and  so 
on.  Fig.  2a  shows  the  initial  data  distribution  of 


A'o’s.  Each  node  starts  work  on  the  dual  segment 
pair  to  obtain  the  updated  value.  Upon  finishing 
step  1  computation,  node  1  needs  Af  from  node 
3  to  continue  computation,  while  node  3  needs 
A^  from  nodtf  1  for  the  next  step.  Thus,  node 

1  exchange  information  with  node  3,  as  shown 
by  the  arc  in  Fig. 2b,  so  does  node  2  with  node 
4.  Fig. 2c  shows  the  information  flow  at  the  last 
step.  After  that  step,  the  computation  in  each 
node  is  completely  independent  because  the  dis¬ 
tance  between  nodes  in  dual  node  pair  is  less  than 
or  equal  to  the  data  segment  length,  so  all  nodes 
have  the  necessary  information  available  in  their 
local  memory  for  the  continuing  computation.  In 
general,  if  there  are  p  =  2''*  nodes  on  the  hy- 
percubc  and  iV  =  2^  elements  in  Ao,  (suppose 
H  >  pi),  p  steps  of  computations  are  needed  for 
FFT  and  inter- processor  communication  is  neces¬ 
sary  only  in  the  first  pi  steps.  There  is  no  com¬ 
munication  at  all  for  the  remaining  p  —  pi  steps. 
Fig.  2d  shows  that  if  data  segments  were  dis¬ 
tributed  into  the  nodes  in  another  way,  commu¬ 
nication  between  non-adjacent  nodes  will  occur 
(node  1  must  communicate  with  node  4  and  node 

2  must  communicate  with  node  3).  To  minimize 
communication  cost  and  improve  the  algorithm 
performance,  the  following  two  points  were  con¬ 
sidered  when  implementing  the  algorithm; 

1.  Eliminating  fragmentary  message  passing 
and  overlapping  communication  with  computa¬ 
tions.  As  was  discussed  previously,  in  the  first  pi 
computation  steps,  each  node  needs  to  exchange 
half  of  its  computed  results  with  another  node.  If 
every  single  data  element  is  exchanged  one  by  one 
with  the  progress  of  the  computation,  communi¬ 
cation  cost  will  be  overwhelmingly  high  due  to  the 
accumulation  of  start  up  cost.  To  overcome  this 
difficult,  each  node  sends  all  data  items  to  be  ex¬ 
changed  in  one  package  after  the  computation  in 
the  current  step  has  completed.  While  waiting  for 
the  data  package  from  the  other  node  to  arrive, 
the  node  can  start  computing  complex  exponenti¬ 
ations  u/*  =  exp{2vik/N)  (including  determining 
k  by  modular  arithmetic  and  computing  complex 
exponentiations)  for  the  next  step.  When  the 
data  package  from  the  other  node  arrives,  these 
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Figure  3;  Data  distribution 


exponentiations  can  be  used  immediately  in  the 
multiplications  and  additions  as  described  by  for¬ 
mula  (1). 

2.  Mapping  all  dual  segment  pairs  that  need 
mutual  communication  into  the  closest  neighbors 
on  a  hypetcube.  The  topological  connection  on 
a  hypercube  with  p  =  2^'  makes  each  node  con¬ 
nected  directly  with  ni  other  nodes  (the  closest 
neighbors).  It  is  clear  from  the  previous  analysis 
that  a  specific  node  holding  a  dual  segment  pair 
needs  to  exchange  intermediate  results  with  fii 
other  dual  segment  pairs.  These  dual  segment 
pairs  are  mapped  into  the  /xi  closest  neighbors  of 
that  specific  node,  thus  all  pairwise  communica¬ 
tions  between  dual  segment  pairs  can  be  done  di¬ 
rectly  without  transferring  through  intermediate 
nodes.  Fig.  3  gives  the  mapping  of  dual  segment 
pairs  into  a  hypercube  with  eight  nodes. 

The  algorithm  for  each  node  can  be  described 
as  follows: 

{get  (i  and  p,\  from  the  host  } 

[get  initial  dual  segment  pair 
A'g  and  Xq  from  the  host  } 
for  k—1  to  fii  do 

{compute  XI  and  X{ 
from  A'l  _ ,  and  _  |  } 

{find  next  node  tip  to  communicate  } 


{  send  out  XI  or  Xj^  } 

{  compute  complex  exponentiations 
while  waiting  for  data  from 
Up  to  arrive  } 
end  for 

for  fc—  fii  +1,  fi  do 

{compute  Xy,  Xl  from  A'^_,  } 

end  for 

{  send  result  X^,  Xj^  back  to  the  host  } 

The  programming  language  is  Fortran  and  the 
communication  subroutines  are  provided  by  the 
operating  system. 

Numerical  Experiment  and  Analysis: 

The  algorithm  was  implemented  on  a  32-node 
iPSC/2  hypercube.  Extensive  numerical  compu¬ 
tations  have  been  carried  out  to  study  the  perfor¬ 
mance  of  the  algorithm.  The  cube  can  be  parti¬ 
tioned  into  subcubes.  Both  the  number  of  data 
points  N  and  the  number  of  nodes  p  were  ad¬ 
justed  as  parameters  during  the  computational 
experiment,  with  N  ranging  from  1024  to  262144 
{fi  from  10  to  18)  and  p  ranging  from  1  to  32  [fix 
from  0  to  5).  The  algorithm  has  also  been  incor¬ 
porated  into  a  fast  PDE  solver  to  solve  large  scale 
reservoir  simulation  problems. 

Fig.  4  gives  the  timing  curve  obtained  from  nu¬ 
merical  computations  for  different  p  values,  where 
iV  =  2**  is  the  number  of  data  elements  to  be 
transformed,  p  —  2'*’  is  the  number  of  nodes 
on  the  hypercube.  The  curves  show  that  the 
computing  lime  decreases  rapidly  as  the  num¬ 
ber  of  nodes  increases,  until  the  optimal  number 
of  nodes  has  been  reached.  For  example,  when 
JV  =  2'®  =  8096,  the  optimal  number  of  p  is 
p  =  2'*  =  16.  After  that,  the  extra  communi¬ 
cation  cost  incurred  by  increasing  the  number  of 
nodes  will  outweigh  the  gain  from  the  reduced  nu¬ 
merical  computation  time,  so  the  total  executing 
time  will  increase.  The  optimal  number  of  nodes 
increases  with  N,  as  indicated  in  Fig.  4. 

For  theoretical  analysis,  we  used 

t  =  a  -t-  /3r 

as  communication  model, 

T.  =  2fNlogjN 
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as  serial  running  time  model  on  one  node,  and 
Tp  =  2filogiN)N/p  +  10p(a  +  l3N/p) 

+(a  +  l3N/p)logjp 

as  parallel  running  time  model  on  p  nodes,  a 
is  the  start-up  time  for  communication,  0  is  the 
time  required  to  send  one  word  to  the  closest 
neighbors  and  /  is  the  time  needed  for  one  float¬ 
ing  point  multiplication.  The  value  of  these  pa¬ 
rameters  are  the  same  as  used  in  [8j.  There  are 
three  terms  in  the  parallel  running  time  model. 
The  first  term  represents  the  numerical  compu¬ 
tations  distributed  to  each  node  which  is  a  de¬ 
creasing  function  of  p.  The  second  term  reflects 
the  communication  time  for  loading  program,  dis¬ 
tributing  data  to  and  collecting  the  final  results 
from  all  the  nodes,  this  term  increases  linearly 
with  the  number  of  nodes  p  since  broadcasting 
can  not  be  used  to  distribute  different  data  seg¬ 
ments  to  all  nodes.  The  last  term  gives  the  time 
spent  on  internodes  communication  during  the 
computation.  This  term  is  usually  much  smaller 
than  the  second  term  since  the  communication 
takes  place  only  between  the  closest  neighbors  in 
the  first  logip  steps.  The  timing  curves  given  by 
mathematical  performance  models  are  very  close 


Figure  5:  Efficiency  curves 


to  the  result  obtained  from  the  numerical  compu¬ 
tation.  Those  curves  have  been  omitted  to  save 
the  space. 

The  optimal  number  of  nodes  pop,  as  a  function 
of  N  can  be  obtained  by  solving  the  minimum  of 
Tp.  If  the  last  communication  term  is  omitted, 
the  approximate  solution  will  be: 

Pop.  =  y/NlogjN/? 

which  indicates  pop.  or 

Fig.  5  gives  the  algorithm  efficiency  curves 
given  by  performance  models.  Different  curves 
correspond  to  different  number  of  nodes  in  the 
computation.  The  algorithm  efficiency  is  defined 
as: 

e  -  T./(Tpp) 

Ideally,  if  there  were  no  overhead  and  commu¬ 
nication,  e  should  be  unity,  meaning  the  compu¬ 
tation  will  finish  in  1/p  the  serial  running  lime. 
The  figure  shows  that  there  is  a  certain  range 
of  p  (iV  =  2**  is  the  number  of  data  elements 
to  be  transformed)  in  which  the  efficiency  in¬ 
creases  very  fast  as  p  increases.  The  algorithm 
is  switching  quickly  from  communication  dom¬ 
inance  to  numerical  computation  dominance  in 
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this  range.  For  a  32  node  iPSC/2  hypercube, 
the  algorithm  efficiency  will  be  over  80  percent 
when  N  is  greater  than  (2*^).  In  summary,  the 
algorithm  introduced  here  is  very  efficient  when 
implemented  on  a  hypercube.  High  efficiency  can 
be  achieved  by  eliminating  fragmentary  message 
passing,  overlapping  computation  and  communi¬ 
cation  and  taking  advantage  of  rich  connections 
available  in  the  cube.  The  algorithm  can  also  be 
used  as  a  core  routine  in  the  time  series  analy¬ 
sis  or  spectral  analysis  algorithm  on  distributed 
memory  computers  to  speed  up  the  large  scale 
computation. 
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Abstract 

An  efficient  distributed  algorithm  for 
evaluating  an  iterative  function  on  all 
pairwise  combinations  of  C  objects  on 
an  SIMD  hypercube  is  presented.  The 
algorithm  achieves  uniform  load  dis¬ 
tribution  and  minimal,  completely  lo¬ 
cal  interprocessor  communication. 

1  Introduction 

The  problem  addressed  here  is  the  following: 
Given  a  set  of  C  objects  uniformly  distributed 
oiuong  the  processors  of  an  SIMD  hypercube, 
and  an  operation  on  pairs  of  objects  which  may 
possibly  modify  the  objects,  is  there  a  way  to 
efficiently  evaluate  the  operation  iteratively  on 
all  the  possible  C{C  —  l)/2  pairwise  combina¬ 
tions  of  the  C  objects  in  a  distributed  fashion 
?  This  problem  arises  for  example  in  the  con¬ 
text  of  parallel  way  graph  partitioning  on  a 
hypercube  [1],  and  in  the  scheduling  of  a  round- 
robin  tournament  between  C  players  using  C/2 
courts,  where  the  paths  between  courts  form  a 
hypercube  interconnection.  Matches  between 


players  are  to  be  scheduled  so  that  the  courts 
are  maximally  utilized  and  the  players  do  min¬ 
imal  walking  between  courts. 

In  an  earlier  study  [4],  a  distributed  solution  to 
the  problem  for  an  MIMD  hypercube  was  pre¬ 
sented,  and  shown  to  be  optimal  with  respect 
to  processor  utilization  and  communication.  In 
this  paper,  we  solve  the  same  problem  for  an 
SIMD  hypercube.  Two  important  constraints 
in  the  iterative  application  of  the  function  make 
the  otherwise  trivial  problem  a  non-trivial  one 
:  1)  the  objects  might  get  modified  by  the  ap¬ 
plication  of  the  operation,  (i.e.  not  read-only) 
and  2)  the  result  of  the  current  step  depends  on 
the  state  of  the  objects  after  the  previous  step 
(iterative).  Since  the  operation  can  change  the 
objects,  a  consisteny  problem  arises  if  multiple 
copies  of  the  same  object  exist  simultaneously 
in  the  distributed  system.  Therefore,  oidy  one 
copy  of  an  object  must  be  allowed  in  the  sys¬ 
tem. 

The  key  to  an  efficient  distributed  pair¬ 
wise  combining  algorithm  is  the  appropriate 
scheduling  of  communication  of  the  objects  be¬ 
tween  the  processors  so  that  all  possible  pairs 
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meet  exactly  once,  and  no  redundant  compu¬ 
tations  occur.  To  achieve  this,  we  require  each 
processor  to  communicate  with  only  its  near¬ 
est  neighbors,  and  do  some  useful!  work  af¬ 
ter  each  communication.  We  present  a  fully 
distributed  algorithm  which  maximally  uti¬ 
lizes  the  system  and  uses  minimal  interpro¬ 
cessor  communication.  The  algorithm  com¬ 
prises  p  +  1  phases,  where  p  is  the  dimen¬ 
sion  of  the  hypercube.  Each  phase  consists  of 
two  subphases  -  an  object-circulation  sub¬ 
phase,  and  a  window-fragmentation  sub¬ 
phase.  Object-circulation  subphase  make 
use  of  the  SIMD  data  circulation  algorithm 
given  in  [2]  with  a  simple  modification  to  han¬ 
dle  variable  window  sizes. 

The  paper  is  organized  as  follows  :  In  section 
2,  we  present  a  fully  distributed  algorithm  us¬ 
ing  only  local  inter-processor  communication 
for  solving  the  pairwise-evaluation  problem  on 
an  SIMD  hypercube.  In  section  3,  the  algo¬ 
rithm  is  shown  to  be  optimal.  Section  4  con¬ 
cludes  the  paper  with  a  brief  discussion. 

2  Distributed  Pairwise- 

Evaluation  on  an  SIMD  Hy¬ 
percube 

We  use  the  following  notation  in  specifying  the 
algorithm: 

Given  a  processor  numbered  ifc,  0<ib<P— 1 

bi(k)  :  d— th  bit  of  the  binary  representation  of  k 
N“{k)  :  the  neighbor  processor  whose  binary 
representation  differs  from  k  in  only  the  d-th  bit 
C\k,C2k  '■  objects  assigned  to  processor  k 
P  =  2'’  :  the  number  of  hypercube  processors 
C  =  2*  :  the  total  number  of  objects 


Pairwise-Evaluation  Algorithm  listed  below 
evaluates  a  given  function  for  all  C{C  —  l)/2 
pairwise  combinations  of  C  objects  using  C/2 
processors.  Initially,  each  processor  Pk  con¬ 
tains  two  of  the  C  objects,  labeled  Clk  and 
C2ic,  with  no  two  processors  containing  the 
same  object.  The  processors  alternate  between 
computation  and  communication,  with  each 
processor  repeatedly  performing:  1)  a  pair¬ 
wise  operation  on  the  two  locally  held  objects, 
and,  2)  communication  of  one  of  the  objects 
to  a  neighbor  processor,  in  turn  receiving  some 
other  object  from  a  neighbor. 


SIMD  Distributed  Pairwise-Evaluation 
Algorithm  : 

Processor  P*  executes: 

1.  for  d  <—  p  to  0  do 

2.  for  s  «—  1  to  2*^  —  1  do 

3.  operate  on  the  pair  (Cljt,C2fc); 

4.  send(C2*,AM<i.«)(jb)); 

5.  recv(C2fc,iV'*(*''')(Jb)); 

6.  endfor 

7.  operate  on  the  pair  (Clit,C2*); 

8.  if  (d  >  0)  then 

9.  if  (6(i-i(fc)  =  1)  then 

10.  send(CU,JV(‘'-»)(fc)); 

11.  recv(Clfc, 

12.  else 

13.  send(C2fc,iV(‘^-^)(fc)); 

14.  Tecv{C2k,N^^~^\k)y, 

15.  endif 

16.  endif 

17.  endfor 

The  key  requirement  is  that  the  objects  be 
moved  between  the  processors  in  such  a  way 
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that  each  possible  pair  of  objects  comes  to¬ 
gether  exactly  once  to  enable  the  application 
of  the  pairwise  operation  on  that  pair.  The 
algorithm  has  p-i-  1  phases  (indexed  by  “d”), 
where  p  is  the  number  of  dimensions  of  the  hy¬ 
percube.  Each  phase  consists  of  two  subphases 
-  an  object-circulation  subphase  where  pro¬ 
cessors  circulate  their  objects  in  closed  windows 
(lines  2-6),  and  a  window-fragmentation 
subphase  where  each  window  subdivides  into 
two  isolated  windows  (lines  8-16).  The  window 
structure  thus  changes  from  phase  to  phase, 
with  2*’“*^  independent  windows  of  size  2**  be¬ 
ing  formed  during  phase  d,  as  illustrated  for  a 
4-dimensional  hypercube  in  Fig.  1. 

For  an  MIMD  hypercube,  object-circulation  in 
a  window  of  size  2*^  can  be  easily  done  by  re¬ 
peatedly  performing,  2“*  —  1  times,  a  circular 
shift  of  1  among  the  processors  belonging  to 
the  ,’ame  window.  On  the  other  hand,  due  to 
the  central  control  in  an  SIMD  hypercube,  op¬ 
timal  circulation  requires  a  special  exchange  se¬ 
quence  Xd  as  described  in  [3].  This  sequence  is 
defined  recursively  as  in  the  following: 

^1=0,  Xd  =  Xd-ud-\,Xd-i  (d>l) 

For  example,  X3  =  0, 1,0, 2,0, 1,0.  Using  Xd 
sequence,  object  circulation  in  a  window  of 
size  2‘^  is  achieved  by  first  circulating  data  in 
windows  of  size  in  parallel  using  Xd-\ 
sequence,  then  performing  a  data  exchange 
across  the  two  windows  (along  bit  d-1),  and  fi¬ 
nally  circulating  the  exchanged  data  in  the  two 
windows  again  using  Xd-\  sequence.  In  the 
algorithm  given  above,  the  notation  h{d,i)  is 


used  to  denote  the  i—th  number  in  the  sequence 
-X’d,  1  <  1  <  2“*.  As  an  example,  /i(3, 1)  =  0, 
fi(3,2)  =  1,  fi(3,3)  =  0,  and  h(3,4)  =  2. 

During  a  phase,  corresponding  to  one  iteration 
of  the  d-loop  of  the  algorithm,  each  processor 
keeps  one  of  its  objects  (Cl)  local,  while  it 
repeatedly  receives,  transforms  and  passes  on 
the  second  object  (C2).  Considering  phase  p, 
with  all  processors  communicating  in  one  single 
window,  at  the  end  of  the  2**  —  1  steps  in  the 
first  part  (the  object-circulation  subphase)  of 
the  phase,  all  objects  constituting  the  various 
Cl*’s  (denoted  CSl)  would  have  been  matched 
up  with  respect  to  every  object  in  the  CS2 
set  (and  the  pairwise  operation  performed  on 
each  such  generated  pair).  Thus  the  only  pair¬ 
ings  between  objects  that  have  not  yet  been 
formed  are  between  the  members  of  the  C«S1 
set  and  likewise,  mutually  among  the  mem¬ 
bers  of  CS2.  The  window-fragmentation  sub¬ 
phase  of  phase  p  involves  pairwise  communica¬ 
tion  exchanges  between  each  processor  and  its 
neighbor  whose  address  differs  in  the  highest 
bit.  During  this  subphase,  each  processor  Fk 
with  highest  address  bit  of  one  (6p_i(fc)=l), 
swaps  its  Cl  object  for  the  C2  object  of  its 
partner  processor(Pj,  with  6p_i(l)=0).  Tlius, 
after  this  communication  subphase,  all  proces¬ 
sors  Pk  with  (6p_i(A:)=l),  will  only  have  ob¬ 
jects  from  the  original  CS2  set,  while  all  pro¬ 
cessors  with  (ftp_i(A:)=0)  will  have  all  the  ob¬ 
jects  comprising  the  original  C«S1  set.  This 
subphase  is  labeled  the  “window-fragmentation 
subphase”  because  the  window  gets  fragmented 
into  two  smaller  windows  and  no  communica- 
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Figure  2:  Illustration  of  Distributed  PC  algo¬ 
rithm  on  a  2-D  hypercube  (4  processors) 
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Figure  1:  Illustration  of  window  formation  in 
different  phases  of  the  Distributed  PC  algo¬ 
rithm 


tion  takes  place  thereafter  between  the  proces¬ 
sors  in  the  “highest-bit- 1”  window  and  those 
in  the  “highest-bit-0”  window.  Thus  in  phase 
(p  —  1),  two  windows  of  size  2^“*  are  formed 
for  the  object-circulation  subphase  and  com¬ 
munication  occurs  between  processors  differ¬ 
ing  in  their  (p  —  2)th  bit  during  the  window- 
fragmentation  subphase. 

During  each  phase  of  the  algorithm,  new 
object-pairs  meet  at  the  processors,  for  appli¬ 
cation  of  the  pairwise  operation.  The  algo¬ 
rithm  guarantees  that  during  an  outer  pass, 
no  pair  of  objects  is  ever  matched  up  more 
than  once.  Fig.  2  is  used  to  illustrate 
this  “no-repetition”  property  of  the  algo- 
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rithm.  In  order  to  focus  on  the  nature  of 
the  window-fragmentation  subphase,  the  ef¬ 
fects  of  the  alternating  object-circulation  sub¬ 
phase  are  intentionally  omitted.  Eight  ob¬ 
jects  are  shown,  mapped  onto  four  proces¬ 
sors,  two  objects  per  processor.  During  phase 

2  (d  =  2),  the  application  of  the  object- 

circulation  subphase  results  in  the  generation 
of  aU  possible  pairwise  combinations  with  one 
object  from  C«S1  and  the 

other  from  CS2  (Boo,Soi,-Bio,Bii).  Ignoring 
for  now  the  actual  permutation  of  the  C2  ob¬ 
jects  that  will  result  at  the  end  of  the  object- 
circulation  subphase,  and  assuming  it  to  be 
as  shown,  the  window-fragmentation  subphase 
of  phase  2  will  result  in  the  state  shown  for 
d  =  1.  Processors  Pqo  and  Pqi  are  left  with 
objects  i4oo,i4oi,i4io,i4ii,  whereas  Pio  and  Pu 
now  have  objects  Boq,Boi,Bio^Bii.  After  the 
window-fragmentation  phase  of  phase  2,  Pq* 
and  Pi,  do  not  ever  again  communicate  with 
each  other.  Since  no  pairwise  combinations 
involving  two  A-objects  had  occurred  during 
phase  2,  and  since  none  of  the  B-objects  can 
any  longer  meet  any  of  the  A-objects,  all 
pairs  of  objects  that  align  at  any  processor  are 
unique  combinations  that  have  not  occurred 
earlier.  The  same  property  clearly  holds  re¬ 
cursively,  as  illustrated  in  the  figure. 

In  the  next  section,  we  formally  prove  the  cor¬ 
rectness  of  the  distributed  algorithm. 

3  Proof  of  Optimality 

Lemma  1  The  total  number  of  pairwise  object 


combinations  that  occur  during  execution  of  the 
algorithm  is  C{C  —  l)/2. 

Proof:  Each  processor  performs  one  pairwise 
comparison  during  every  step  of  every  phase  of 
the  algorithm,  as  is  clear  from  the  algorithm 
specification.  The  number  of  steps  in  phase 
d  is  2^.  Hence  the  total  number  of  pairwise 
combinations  tried  is: 

0 

2'’  *  5^  2*'  =  2"  ♦  (2(P+^)  -  1) 

d=p 

=  P(2P-l)  =  C(C-l)/2 

□ 

Lemma  2  Given  any  objects  Ci  and  Cj,  the 
combination  (Ci,Cj)  can  occur  at  most  once 
during  execution  of  the  algorithm. 

Proof:  Let  d  be  the  earliest  phase  that  the 
combination  (C,-,C;)  occurs.  Obviously,  at 
most  one  such  match  can  occur  during  the 
object-circulation  subphase  of  phase  d.  For 
such  a  match  to  occur,  one  of  them  must  be¬ 
long  to  the  Cl-object-set  and  the  other  to  the 
C2-object-set.  Since  they  belong  to  different 
object-sets,  during  the  window-fragmentation 
subphase  of  phase  d,  C,-  and  Cj  will  necessar¬ 
ily  end  up  in  processors  P*,  P/,  where  1  and  k 
differ  at  least  in  bit  d  —  1,  and  hence  P*  and  Pi 
belong  to  different  windows.  Obviously,  they 
cannot  get  matched  in  any  later  phase  d'  <  d. 
Hence  at  most  one  match  {Ci,Cj)  can  occur 
during  an  outer  pass.  □ 
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Theorem  1  Given  any  two  objects  Ci  and  Cj, 
the  pairwise  combination  (Ci,Cj)  occurs  ex¬ 
actly  once  during  execution  of  the  algorithm. 

Proof:  Theoreni  1  follows  immediately  from 
lemma  1  and  lemma  2.  By  lemma  1,  a  total  of 
C{C  —  l)/2  pairwise  combinations  occur,  and 
by  lemma  2,  no  combination  (Ci,Cj)  cam  occur 
more  than  once.  Since  the  number  of  possible 
distinct  combinations  of  object  pairs  is  C(C  — 
l)/2,  all  possible  matches  must  occur  exactly 
once  during  execution  of  the  algorithm.  □ 

Theorem  1  implies  that  as  regards  to  computa¬ 
tion,  the  algorithm  is  optimal  since  every  pro¬ 
cessor  is  busy  during  each  computational  step 
and  no  duplicate  computations  occur.  With 
respect  to  communication  too,  under  the  con¬ 
straint  of  computational  load  balancing  and 
uniform  data  distribution,  each  processor  can 
only  contain  two  objects,  and  after  perform¬ 
ing  the  pairwise  operation  on  its  currently  held 
p<iir,  it  will  have  to  send  out  at  least  one  ob¬ 
ject  and  receive  one  object  in  order  to  perform 
useful  computation  at  the  next  step.  The  algo¬ 
rithm  causes  only  one  object  to  be  sent  and  one 
object  received  by  each  processor  at  each  step, 
i.e.,  the  a!;iorithm  performs  minimal  communi¬ 
cation. 

4  Discussion 

An  efficient  distributed  algorithm  for  evaluat¬ 
ing  an  iterative  function  on  all  pairwise  combi¬ 
nations  of  C  objects  on  an  SIMD  hypercube  is 
presented,  and  it  is  shown  to  achieve  uniform 


load  distribution  and  minimal,  completely  local 
inter-processor  communication. 

In  case  that  C  >  2P,  the  algorithm  can  be 
extended  in  a  straightforward  fashion.  For 
C  =  MP,  M  =  2k,  k  >  1,  groups  of  Af/2  ob¬ 
jects  should  be  considered  in  place  of  single  ob¬ 
jects  in  the  presented  algorithm.  Now,  instead 
of  a  single  pairwise  operation,  (Af/2)^  pairwise 
operations  are  performed  at  each  step  of  the  al¬ 
gorithm  between  member  partitions  of  the  two 
(Af/2)-ary  object-groups  in  a  processor.  With 
such  a  (M/2)  —  ary  group  of  objects  in  place  of 
single  objects,  the  algorithm  for  distributed  PC 
is  essentially  the  same  as  above,  except  for  an 
additional  set  of  operations  between  the  com¬ 
ponents  of  each  (M/2)  —  ary  group  of  objects. 
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Abstract 

Reshaping  of  arrays  is  a  convenient  programming  primi¬ 
tive.  For  arrays  encoded  in  a  binary-reflected  Gray  code 
reshaping  implies  code  change.  We  show  that  an  axis 
splitting,  or  combining  of  two  axes,  requires  conununica- 
tion  in  exactly  one  dimension,  and  that  for  multiple  axes 
splittings  the  exchanges  in  the  different  dimensions  can 
be  ordered  arbitrarily.  The  number  of  element  transfers 
in  sequence  is  independent  of  the  number  of  dimensions 
requiring  commmrication  for  large  local  data  sets,  and 
concurrent  communication.  The  lower  bound  for  the 
nximber  of  element  transfers  in  sequence  is  ^  with  K 
elements  per  processor.  We  present  algorithms  that  is 
of  this  complexity  for  some  cases,  and  of  complexity  K 
in  the  worst  case.  Conversion  between  binary  code  and 
binary-reflected  Gray  code  is  a  special  case  of  reshap¬ 
ing. 

1  Introduction 

In  computer  systems  locality  of  reference  has  had  a  sig¬ 
nificant  impact  on  performance  ever  since  memory  hi¬ 
erarchies  were  introduced.  In  modern  computer  sys¬ 
tems  small  memories  in  MOS  technologies  may  be  de¬ 
signed  for  higher  speeds  than  larger  memories.  Li  multi¬ 
processor  systems  with  processors  ai  d  memory  modules 
interconnected  via  a  network,  the  access  time  for  non¬ 
local  information  is  typically  considerably  longer  than 
local  access.  Moreover,  the  access  time  depends  upon 
the  network  topology,  congestion  and  bandwidth  of  the 
conununications  network.  The  reference  pattern  has  a 
significant  impact  on  the  optimal  data  allocation  in  net¬ 
works  that  have  a  non-uniform  distance  between  pairs 
of  nodes,  such  as  Boolean  cube  networks. 

In  well  structured  computations  the  data  is  conve¬ 
niently  represented  by  arrays.  Many  algorithms  require 
local  references  in  a  Cartesian  space  corresponding  to 
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the  array.  Explicit  methods  for  the  solution  of  partial 
differential  equations  are  examples  thereof.  Preserving 
the  locality  in  the  Cartesian  space  when  mtqiped  to  the 
processor  network  is  important  with  respect  to  perfor¬ 
mance.  The  binary-reflected  Gray  code  is  often  used  to 
accomplish  this  task  in  Boolean  cube  networks.  Succes¬ 
sive  integers  in  the  decimal  encoding  differ  by  one  bit  in 
their  Gray  code  encoding.  This  property  is  used  in  CM- 
Portran  (1],  Thinking  Machines  Corp.  version  of  Fortran 
8X  [11]  for  the  Connection  Machine.  In  this  language 
implementation,  array  axes  are  by  default  encoded  in  a 
binary-reflected  Gray  code. 

Some  important  algorithms  with  a  regular  communi¬ 
cation  pattern  depend  on  local  references  in  a  Boolean 
space.  For  instance,  the  Fast  Fourier  Transform  re¬ 
quires  communication  in  the  form  of  a  butterfly  net¬ 
work,  which  implies  conununication  between  ac^jacent 
nodes  in  a  Boolean  space  with  corresponding  nodes  in 
different  ranks  mapped  to  the  same  processor.  In  many 
scientific  and  engineering  applications  algorithms  that 
depend  upon  both  types  of  access  patterns  may  be  used, 
and  conversion  between  the  two  storage  forms  may  be 
important. 

Many  recursive  algorithms  make  use  of  axis  split¬ 
ting,  or  combining.  An  example  is  the  data  parallel 
implementation  [2]  of  the  divide-and-conquer  algorithm 
by  Dongarra  and  Sorensen  [3]  for  computing  eigenval¬ 
ues  of  symmetric  tridiagonal  systems.  Array  manipula¬ 
tion  through  operations  such  as  RESHAPE  in  Fortran 
8X  and  APL,  impacts  the  encoding  for  binary-reflected 
Gray  coded  axes.  The  encoding  of  binary  coded  axes  is 
xinaffectcd. 

Different  axes  may  have  different  encoding.  For  in¬ 
stance,  if  butterfly  computations  are  performed  along 
one  axis,  and  nearest-neighbor  communications  in 
a  Cartesian  space  along  the  other  axis  of  a  two- 
dimensional  array,  then  binary  encoding  of  the  first  axis 
and  binary-reflected  Gray  code  encoding  of  the  second 
axis  is  desirable.  Furthermore,  the  encoding  of  a  sin¬ 
gle  axis  may  be  mixed.  Typically  the  number  of  array 
elements  along  an  axis  exceeds  the  number  of  proces¬ 
sors  allocated  to  the  axis,  forcing  several  elements  along 
an  axis  to  be  allocated  to  the  memory  of  each  proces- 
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sor  with  the  array  elements  being  allocated  as  evenly  as 
possible.  Cyclic  and  consecutive  [6]  allocation  are  two 
common  schemes  for  assigning  multiple  elements  to  pro¬ 
cessors.  With  local  random  access  memories  distance  is 
not  an  issue  in  detenniiring  the  encoding  for  the  local 
memories.  Binary  encoding  is  typically  used  for  the  lo¬ 
cal  part  of  an  axis,  and  binary-reflected  Gray  code  for 
the  processor  part. 

As  an  example  consider  a  two-dimensional  logic  ar¬ 
ray  A  of  shape  P  x  Q  allocated  to  an  Nj  x  No  physical 
array  of  processors,  where  P  =  2^,  Q  =  2*,  Ni  =  2"*, 
No  =  2’*°,  p  >  ni  and  q  >  no-  The  data  allocation  is 
consecutive,  and  each  array  axis  is  encoded  in  a  binary- 
reflected  Gray  code.  Bit  m  in  the  address  space  is  de¬ 
noted  Pm  if  encoded  in  a  binary-reflected  Gray  code,  and 
6m  if  encoded  in  binary  code.  Bit  zero,  or  dimension 
zero,  is  the  least  signiflcant,  and  the  rightmost  dimen¬ 
sion  in  our  expressions.  The  symbol  ||  denotes  concate¬ 
nation  of  two  flelds.  Axes  are  also  labeled  right  to  left. 
We  illustrate  the  allocation  as  follows 

{Sf—i9p—2  *  *  Sp—m  Pp— m — iffp— 1*1 — 2  *  ‘  II 

' - V - '' - V - ' 

paddr^  mAddr'' 

y®  y®  y®  y®  y®  y®  ^ 

— 3  *  ‘  *  Oo}' 

paddr"  maddt” 

The  processor  address  for  an  element  {i,j)  of  the 
logic  array  is  formed  as  (paddr'(t)|(paddr°(y)),  and  the 
local  storage  address  is  (maddr’(t)||maddr*’(j)),  where 
•  •  ffo)  >*  binary-reflected  Gray 
code  encoding  of  i,  and  C?,(j)  =  ■ffo)  >* 

binary- reflected  Gray  code  encoding  of  j.  Reshaping 
the  logic  array  into  a  one- dimensional  array  such  that 
(i,j)  —*  iQ  +  j  preserving  the  assignment  of  bits  in  the 
logic  array  to  bits  in  the  physical  address  space  implies 
a  code  conversion  for  axis  zero  if  i  is  odd,  and  data  mo¬ 
tion  within  no  dimensional  subcubes.  The  result  is  an 
allocation  of  the  form 


(Pp+f-iPp+f-J  ■  ■  ■  Pp+f-Bi  Pp+f-Bi  -iPp+f-m  -  J  ■  ■  ■  Pf  I 


fff-lPf-J  ■  ■  ■  Pf-no  Pf-Bo-lPf-no-2  '  '  '  Po) 


where,  as  shown  later,  Pm+f  =  Pm>  €  {0, •  •  ■  ,p  —  1} 
and  Pm  =  Pm.  6  {0.  •  ■  •  >9  -  2}  The  value  of  p,_i 
depends  upon  the  value  of  Py.  Figure  1  illustrates  the 
data  motion. 

Note  that  whereas  the  initial  data  allocation  was  con¬ 
secutive  the  data  allocation  after  reshaping  is  not.  If 
a  consistent  data  allocation  is  desired,  i.e.,  the  same 
data  allocation  scheme  before  and  after  reshaping,  then 
it  is  in  general  necessary  to  change  the  assignment  of 
dimensions  in  the  logic  address  space  to  dimensions  in 
the  physical  address  space.  A  dimension  permutation 


Figure  1;  Reshaping  an  1  x  16  array  to  a  4  x  4  array. 


(4,13,12,15,10,5]  in  the  form  of  an  no  step  right  cyclic 
shift,  or  p  —  ni  steps  left  cyclic  shift  on  the  dimensions 
in  the  field  (maddr’ljpaddr*’)  is  required,  in  combination 
with  code  conversion. 

With  consecutive  allocation  of  A  and  a  binary  encod¬ 
ing  of  local  addresses,  and  a  binary-reflected  Gray  code 
encoding  of  processor  addresses,  the  processor  address  of 
element  [i,j)  is  formed  by  computing  the  address  from 
the  binary-reflected  Gray  codes  of  [t'/lViJ  and  [j’/NoJ- 
The  local  memory  address  is  determined  from  the  bi¬ 
nary  codes  of  i  mod  Ni  and  j  mod  No-  The  encoding  of 
the  address  field  is 


Reconfiguration  of  the  processor  array  is  equivalent  to 
changing  the  assignment  of  dimensions  in  the  logic  ad¬ 
dress  space  to  dimensions  in  the  physical  address  space. 
A  dimension  permutation  is  required.  If  the  encoding 
of  the  local  address  field  is  different  from  the  proces¬ 
sor  address  field,  then  a  code  conversion  is  required  in 
combination  with  the  dimension  permutation.  Reconfig¬ 
uration  of  a  processor  array  may  be  required  to  assure 


that  all  operands  use  the  same  physical  machine  con¬ 
figuration,  as  for  instance  in  matrix  multiplication  on 
the  Connection  Machine  [8].  The  Connection  Machine 
Fortran  compiler  allocates  logic  arrays  to  the  processors 
by  defining  a  processor  array  congruent  to  the  logic  ar¬ 
ray  for  each  array.  Hence,  in  the  matrix  multiplication 
C  •—  X  B  all  three  matrices  may  assume  a  difierent 
shape  of  the  processor  array. 

In  this  paper,  we  show  how  an  axis  splitting,  or 
the  combining  of  two  axes  into  one,  can  be  performed 
by  a  single  exchange  operation.  For  multiple  axes 
split/merge  operations,  the  number  of  element  trans¬ 
fers  in  seqiience  is  independent  of  the  number  of  axes 
created  or  merged,  if  the  comnninication  system  allows 
concurrent  communication  in  all  required  dimensions. 
The  number  of  element  transfers  in  sequence  is  only  a 
function  of  the  size  of  the  local  data  set,  if  there  is  a 
large  local  data  set.  The  nrinimum  number  of  element 
transfers  in  sequence  is  equal  to  the  number  of  dimen¬ 
sions  requiring  comnnmication.  The  conversion  between 
binary-reflected  Gray  code  and  binary  code  is  equiv't- 
lent  to  reshaping  between  a  one-dimensional  array  and 
a  2  X  2  X  ■  ■  ■  X  2  array  of  dimension  n. 

The  algorithms  we  give  for  reshaping  and  code  conver- 
si<m  are  either  asymptotically  optimal,  or  optimal  within 
a  factor  of  two  with  respect  to  data  transfer  time.  The 
control  information  can  be  computed  locally  from  the 
node  address.  The  code  conversion  can  start  in  any  di¬ 
mension,  and  the  required  exchanges  can  be  carried  out 
in  dimensions  ordered  arbitrarily.  This  property  allows 
reshaping  by  concurrent  communication  in  all  required 
dimensions,  if  the  size  of  the  local  data  set  exceeds  the 
number  of  dimensions  requiring  commimication.  Com¬ 
pared  to  the  algorithms  in  [6,7]  the  new  algorithms  avoid 
the  pipeline  delay.  Here  we  only  treat  the  case  with 
an  entire  axis  encoded  in  either  binary  code,  or  binary- 
reflected  Gray  code.  Furthermore,  we  assume  a  fixed 
assignment  of  dimensions  in  the  logic  address  space  to 
dimensions  in  the  physical  address  space.  Reshaping 
combined  with  dimension  permutations  is  considered  in 

[9]- 

The  paper  is  organized  as  follows.  Notation  and  def¬ 
initions  are  introduced  next.  Array  reshaping  is  dis¬ 
cussed  in  Section  3.  The  conversion  between  binary- 
reflected  Gray  code  and  binary  code  is  discussed  in  Sec¬ 
tion  4,  followed  by  summary  in  Section  5. 


2  Preliminaries 


a  bit  with  value  one.  is  the  concatenation  83rmbol. 
For  the  complexity  estimates  we  assume  bi-directional 
channels  and  concurrent  conununication  on  all  channels. 
The  number  of  elements  per  node  is  K.  Gn  is 

2uence  of  n-bit  binary-reflected  Gray  codes  for  Zg,  i.e., 
^  =  (G,(0),Gn(l),  --,G„(2»  -1)). 

Definition  1  [14]  The  binary-reflected  Gray  code  is  de¬ 
fined  recursively  as  follows. 

^1  =  (Gi(0),Gi(l)),  where  Gi(0)  =  0, Gi(l)  =  1. 


/  0||G.(0)  \ 

0j|G,(l) 


G 


n-fl 


0||G„(2«  -  2) 
0j|G«(2“  -  1) 
1||G„(2»-1) 
1||G«(2»  -  2) 


V  11|G«(0)  / 


In  the  following  we  always  refer  to  the  binary-reflected 
Gray  code  defined  above. 


Corollary  1  The  higheet  order  bit  it  the  tame  in  the 
binary  code  and  the  binary-reflected  Gray  code.  The 
remaining  bitt  in  the  encoding  of  i  €  2jv/2 
fined  by  G„_j((h»-.3l>»-.3  ■  •  •  ho))-  The  remaining  bitt 
in  the  encoding  of  i  £  Zs  —  Zif/2  «re  defined  by 
G'«_i((b„_2hn-j  •  •  •  ho))-  That, 


^^((hn-ihn-j--  ho)) 


hn-ll|Irn-l((h|,_2h»_j  - 

-ho)). 

o 

II 

1 

J 

h»-i||G„_i((h»_2h,_j  - 

■h)), 

«/h„_i  =  1. 

Proof:  FVom  Definition  1.  I 


Corollary  3  The  integer  encoded  in  the  neighbor  of 
node  G»(i)  in  cube  dimention  j  it  G„(i  ©(I'*"*)),  i.e., 
G,(i)®2>  =  G,(i  ©(!>+•)). 

Proof:  It  foUows  from  Corollary  1.  I 


A  Boolean  n-cube  has  AT  =  2"  nodes.  Two  nodes 
are  adjacent  if  and  only  if  their  addresses  differ 
in  exactly  one  bit.  The  binary  encoding  of  i  is 
Bn(i)  =  (hn-ih„_2  -  -  -  ho)  and  its  binary-reflected  Gray 
code  encoding  is  G„(i)  =  (fln-iffn- j  -  -  ffo)-  Zm  = 
{0, 1,  -  •  - ,  Af  -  1}  and  (!•*)  is  a  string  of  j  instances  of 


Definition  3  With  binary-reflected  Gray  code  encod¬ 
ing  of  an  Af-element  one-dimensional  array  A[i],i  €  Zfi 
into  an  n-cube,  address  G.fi)  contains  Aji). 

Lemma  1  |14j  h„  =  p*_i  9  Qn-i  9  ■  ®  JU,  6  Z^. 

Conversely,  =  h„  ®  h„+i,  m  G  with  h»  =  0. 
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Figure  2;  Eesnnpiug  between  two  arrays  with  bi¬ 
nary-reflected  viray  code  encoded  on  a  Boolean  cube. 


Deflnition  S  Let  A  be  an  array  of  shape  U4-1  x  Ud-2  x 

■  •  •  X  Uo,  U  =  {Ud-\,Ui-2,  •1  Uq),  Um  =  2“”,  m  6  Zd, 

V  =  (Vd’-uVd'-2,  -,Vo),  K„  =  2*",  m  e  Zd^  and 

nTO=*o^m  =  nm=o  The  reshape  function  p{U,V) 
transforms  the  thape  of  the  array  A  from  U  to  F. 

L«‘  “*  =  (E™=o**";)  -  1.  n  =  (Em=o^--.)  -  1.  W  = 
|0<fc<d-l},  V  =  {5*  I  0  <  fc  <  d'  —  1}  and  2>  = 
(WU  V)  —  (U  n  V).  The  sets  U  and  V  are  the  sets  of  most 
signifleant  diniensions  for  the  axes  of  the  shapes  U  and 
V,  with  the  most  significant  axes  excluded.  For  instance, 
if  Cf  =  (2*,  2*.  2*,  2^)  and  V  =  (2\  2^  2*,  2*),  then  U  = 
{8,6,2},  V  =  {10,8,4}  and  V  =  {10,6,4,2},  Figure  2. 
To  form  the  shape  V  from  U  communication  is  required 
in  the  set  of  dimensions  defined  by  2/  —  V  for  axes  being 
combined  into  one,  and  the  set  of  diniensions  defined  by 

V  —  U  for  axes  being  spUt.  V  is  the  set  of  dimensions 
for  whicli  communication  is  required  for  changing  the 
shiq>e  U  into  V.  Ppnddi  is  the  subset  of  dimensions  in  P 
assigned  to  processor  dimensions  in  the  physical  address 
space.  Pninddx  =  P  —  PpKddr  i®  the  set  of  dimensions  in 
P  assigned  to  local  memory  dimensions  in  the  physical 
address  space. 


3  Reshaping  Arrays 

Lenuna  2  below  states  the  fact  that  splitting  an  axis 
into  two,  or  merging  two  axes  into  one,  requires  a  code 
change  in  precisely  one  dimension. 


(0H_i(i)p„_2(t)  •  ■  -poCO)  initially  contains  element  j4[t]. 
Let  t  “  lc2**  *4“  /,  f  G  ^2*** ,  ^ 
ter  the  reshape  operation  element  t,  now  (fc, /), 
should  reside  in  address  (G'n-m(^)||f7,„(f)),  where 

and  G„{1)  =  G„((6,„_,(f)b,„_2(/)--6o(/)))  = 

(pm-l(f)Pm-j(f)  •  •  •  Po(2)). 

iProm  the  binary  encoding  6,(fe)  =  6TO-t->(t)'  3  ^ 
and  hj[l)  =  6,(»),  j  6  Z^-  By  Lemma  1,  = 

hi{k)®hj+i{k)  =  f)„+^(i)e6m+,+i(»)  =  »«,+/(*).  “U 

i  e  andpy(2)  =  6y(/)©6;+i(/)  =  = 

p^(»)  for  all  j  6  Z„-i.  But,  Pm-i(f)  =  *'T»-i(f)  ® 
f>m(f)  =  and  Pm_l(t)  =  > 

Pm-l(f)  =  < _ 

[Pm-l(*)>  jf<»m(»)  =  l- 

Hence,  if  6„(»)  =  0  then  (?,(»)  = 
and  no  data  motion  is  necessary  for  reshiq>ing.  But, 
if  b,n(t)  =  1  then  an  exchange  is  required  in  dimension 
m  —  1 ,  and  only  in  dimension  m  —  1 ,  since  this  dimension 
is  the  only  dimension  in  which  the  code  for  t  and  {k,l) 
differs.  I 

The  change  in  the  binary-reflected  Gray  code  caused 
by  an  axis  splitting,  or  the  merging  of  two  axes,  is  lim¬ 
ited  to  the  most  significant  dimension  of  the  lower  or¬ 
dered  axis  in  the  created  pair  of  axes.  The  pairs  of  ad¬ 
dresses  exchanging  content  in  a  given  dimension  depend 
upon  the  order  of  exchanges  in  the  case  of  multiple  axes 
splittings.  The  control  of  the  exchange  is  derived  from 
b„t  in  the  encoding  of  1.  The  index  i  assigned  to  an  ad¬ 
dress  changes  if  a  more  significant  controlling  dimension 
is  one.  For  example,  consider  the  reshaping  of  an  array 
of  8  elements  encoded  in  a  binary-reflected  Gray  code 
to  an  array  of  2  x  2  x  2  elements  (which  is  equiTTalent  to 
conversion  to  binary  code).  Figure  3  shows  exchanged 
data  in  boldface,  and  two  exchange  orders:  dimension 
one  then  zero,  or  zero  then  one.  As  is  apparent  from 
Figure  3,  an  exchange  is  carried  out  in  dimension  one 
between  addresses  110  and  111  if  the  dimensions  are 
treated  in  the  order  one  first  then  zero,  but  not  if  the 
order  is  dimension  zero  first,  then  dimension  one. 

The  current  value  of  that  is  assigned  to  a  given 
address  (pn-iP»-2  '  ffo)  i»  easily  determined  from  the 
address. 


Lemma  2  Assume  node  Gn(i)  contains  element  A[t], 
i  6  Zjs,  initially.  If  all  nodes  z  =  (bn-if’n-2  • '  •  ^0) 
suck  that  =  1  exchange  data  in  dimension 

m  —  1  for  any  m  6  {1,2,  —  1},  then 

node  G„_,„((6„_ifc„_2  •  ■  ■  bm))||f'm((^’>Fi-lf>m-2  •  •  ■  ^0)) 
confaini  element  i4[i]  after  the  exchange. 

Proof:  Assume  that  the  reshape  operation  is  tf  = 
(2")  -*  V  =  (2"-"*,  2’'*),  and  that  address  G„(i)  = 


Lemma  S  If  the  number  of  exchanges  in  dimensions 
more  significant  than  m  is  ev^n,  then  the  current  value 
of  logic  dimension  m  assigned  to  an  address  Gn(i)  = 
(p»-iPit-2  •  •  •  ffo)  *•  bm,  otherwise  it  is  b„. 

The  lemma  follows  directly  from  Corollary  2. 

Half  of  the  total  number  of  elements  need  to  be  ex¬ 
changed  for  any  split/merge  operation.  Hence,  the  num¬ 
ber  of  exchanges  in  which  an  element  participates  falls  in 
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Figure  3:  Reshaping  an  array  of  8  elements  into  a  2  x  2  x  2 
array. 


the  range  0  -  \V\,  dependmg  upon  its  binary  encoding. 

'  u 

The  total  mimber  of  element  exchanges  is  \V\  Y  "  for 
changing  shape  U  to  shape  V.  We  wiU  now  d.:termine 
the  number  of  element  exdtanges  in  sequence  when  the 

logic  array  is  allocated  to  an  n-cube,  with  =  K 

elements  per  processor. 

Theorem  1  A  lower  hound  for  the  number  of  element 
transfers  in  sequence  for  array  reshaping  affecting  the 
encoding  of  processor  dimensions  is  R/2  with  K  ele¬ 
ments  per  processor. 

Proof:  Pick  a  dimension  d  €  There  are  N/2 

processors  that  need  to  transfer  data  across  dimension  d. 
There  are  K  elements  in  each  processor,  and  all  elements 
need  to  be  exchanged.  The  available  bandwidth  per 
dimension  is  N.  I 

In  the  following,  let  S  =  |Pp„wi|. 

Theorem  2  Changing  the  shape  U  to  shape  V  preserv¬ 
ing  the  assignment  of  logic  dimensions  to  physical  di¬ 
mensions  requires  at  most  element  transfers  in 

sequence  with  concurrent  communication. 

Proof:  Let  -  {c<s_i,d*_2,  •  ,do}.  Partition 

the  local  data  set  of  size  K  into  S  sets  of  size  at  most 
each.  Label  the  data  sets  from  0  to  6  —  1.  Each 
such  set  is  assigned  a  sequence  of  dimensions  including 
aU  dimensions  in  Pp«,ia,  once.  Different  sets  are  assigned 
different  sequences  such  that  no  two  sets  have  the  same 
first,  second,  third,  etc.,  dimension.  For  instance,  let 
data  in  set  m  be  assigned  the  sequence  of  dimensions 

I  + 1 ) mods  *  '  '  i  —  l)niodS  *  ® 

The  upper  bound  in  Theorem  2  differs  from  the  lower 
bound  by  a  factor  of  two.  The  upper  bound  can  be 
improved  in  some  cases.  We  give  upper  bounds  that  are 
almost  identical  to  the  lower  bounds  for  two  cases. 


Theorem  S  Changing  the  shape  U  to  shape  V  preserv¬ 
ing  the  assignment  of  logic  dimensions  to  physical  di¬ 
mensions  requires  at  most  +  1  element  transfers 

in  sequence  with  concurrent  communication,  if  no  two 
elements  o/ Pp.ddi  differ  by  one  and  K  >  28. 

Proof:  Consider  the  merging  of  a  single  pair  of  axes,  or 
splitting  of  an  axis.  Assume  the  communication  occurs 
in  dimension  m  —  1.  Consider  a  2-cube  formed  by  dimen¬ 
sions  m  and  m  —  1.  Label  the  four  nodes  according  to 
hmhm-i-  By  Lemma  2,  communication  is  only  required 
between  nodes  10  and  11.  There  exist  two  edge-disjoint 
paths  between  these  two  nodes  of  lengths  one  and  three, 
respectively.  By  assigning  [^1  +  1  elements  to  the  path 
of  length  one  and  the  remaining  elements  to  the  path 
of  length  three,  -1-1)  element  transfers  in  sequence 
are  required. 

If  no  two  elements  in  Pp«ddt  differ  by  one,  then  the 
2-cubes  used  for  different  data  sets  are  disjoint.  Thus, 
clement  transfers  in  sequence  are  required. 
To  reduce  the  communication  complexity  to  5  f  1 , 

we  slightly  overlap  the  communications  on  the  succes¬ 
sive  2-cubes  of  a  given  data  set.  Without  this  overlap 
no  data  is  sent  along  the  length  three  path  during  the 
last  two  cycles  of  the  routing  of  a  data  set.  By  send¬ 
ing  two  elements  that  have  been  routed  with  respect  to 
the  first  2-cube  to  the  length-three  path  of  the  second 
2-cube  daring  the  last  two  cycles  of  the  routing  phase  of 
the  first  2-cube  (with  one  cycle  each),  the  conununica- 
tion  delay  due  to  the  length- three  path  is  only  paid  once. 
Sending  elements  along  the  length-three  path  during  the 
last  two  cycles  of  the  first  2-cube  will  not  interfere  with 
the  communication  of  the  data  set  exchanged  in  the  sec¬ 
ond  2-cube.  The  reduced  complexity  is  valid  if  fy]  >  2, 
i.e.,  some  data  set  has  at  least  three  elements.  I 

In  the  routing  used  for  the  proof  of  the  bound,  the 
number  of  elements  routed  along  the  length-one  path 
and  the  length- three  path  differ  by  two  only  for  the  first 
2-cube.  For  subsequent  2-cubes,  the  same  number  of 
elements  are  routed  along  each  path,  with  the  length- 
three  path  starting  two  cycles  earlier.  The  first  element 
on  both  paths  arrives  at  the  same  time  within  the  2-cube 
except  for  the  first  2-cube.  K  28  divides  K  and  K  >  28, 
then  the  complexity  is  ^  -(-1,  which  is  only  one  element 
transfer  above  the  lower  bound.  For  K  <  28,  there  is 
no  advantage  of  using  the  length-three  paths  over  the 
algorithm  used  in  the  proof  of  Theorem  2. 

If  the  reshape  operation  requires  communication  in 
dimensions  m  —  1  and  m  (by  creating  an  axis  of  length 
2  encoded  in  dimension  m),  then  dimension  m  cannot 
be  used  for  rerouting  to  access  unused  communication 
Unks  in  dimension  m  -  1.  Unused  bnks  in  dimensions 
lower  than  m  —  1  cannot  be  used  either,  since  they  do 
not  connect  to  processors  with  \inused  Unks  in  dimension 
m  —  1.  However,  the  following  observation  can  be  used 
to  reduce  the  number  of  element  transfers  in  sequence. 
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Lemma  4  For  a  reshape  operation  requiring  communi¬ 
cation  t'n  dimension  m—  1  none  of  the  Hnks  in  dimension 
m  —  1  u  used  m  m  —  1  dimensional  suhcubes  obtained 
through  complementing  any  of  the  address  dimensions 
that  are  more  significant  than  m  —  1. 

Proof;  We  need  to  show  that  in  any  m  —  1  dimensiona] 
subcube  defined  by  dimensions  m  and  higher,  6^  =  0  if 
the  address  defining  the  subcube  is  obtained  by  comple¬ 
menting  a  single  dimension  of  significance  m  or  higher. 
But,  by  Lemma  1  complementing  a  single  dimension  g^, 
j  g  {m,  m  1,  •  •  • ,  n  -  1}  complements  bm-  i 

By  using  a  pipelined  algorithm  instead  of  the  non- 
pipelined  maximally  concurrent  algorithm  used  for  the 
upper  bound  in  Theorem  3,  the  properties  in  Lemma  4 
can  be  exploited  to  establish  the  following  bound. 

Theorem  4  Changing  the  shape  U  to  shape  V  requires 
at  most  +  25  —  1  element  transfers  in  sequence,  if 
for  each  dimension  requiring  communication  there  exists 
one  more  significant  dimension  not  requiring  communi¬ 
cation  and  K  >  26. 

Proof:  The  problem  is  equivalent  to  sending  K  ele¬ 
ments  along  a  path  of  length  5  and  each  edge  on  the 
path  is  paired  with  a  length- three  path,  disjoint  with  all 
other  edges.  If  5  is  even  two  edge-disjoint  paths  of  length 
25  can  be  defined  by  combining  length-three  and  length- 
one  paths  for  different  dimensions.  If  5  is  odd,  then  two 
paths  of  length  25  —  1  and  25  -f-  1  can  be  defined  in  a 
similar  way.  I 

Several  routing  sdiemes  yield  the  same  complexity  as 
the  scheme  used  in  the  proof.  For  instance,  by  creating 
one  path  of  length  5  and  one  of  length  35,  and  routing 
fy]  -f  5  elements  along  the  short  route  and  —  5 
elements  along  the  long  route  the  same  routing  time  is 
achieved  if  if  >  25.  For  if  <  25,  the  latter  approach 
degenerates  to  using  a  single  path  of  length  5  and  the 
required  time  is  if  -f  5  -  1 ,  which  is  lower  than  if  two 
paths  of  the  same  length  were  used.  However,  if  if  <25 
then  the  time  for  reshaping  by  pipelining  along  one  path 
is  higher  than,  or  at  best  the  same  os  if  the  concurrent 
exchange  algorithm  in  the  proof  of  Theorem  2  is  used. 

Lemma  4  cannot  be  exploited  directly  for  concurrent 
exchange  sequences  because  an  exchange  in  one  dimen¬ 
sion  affects  the  set  of  edges  being  used  in  a  subcube. 
This  property  follows  from  Lenuna  3.  For  instance,  if 
a  1  X  16  array  is  reshaped  into  a  4  x  2  x  2  array,  then 
if  an  exchange  in  dimension  one  is  performed  first  the 
required  exchanges  in  dimension  zero  are  all  on  corre¬ 
sponding  links  in  different  subcubes  instead  of  compU- 
mentary  links. 


4  Conversion  between  Gray 
code  and  binary  code 

Theorem  5  The  conversion  between  a  binary-reflected 
Gray  code  and  binary  code  in  either  direction  requires 
communication  in  n  —  1  dimensions,  and  at  most  (n  — 
element  transfers  in  sequence. 

Theorem  5  follows  from  Theorem  2  and  the  obser¬ 
vation  that  conversion  from  binary-reflected  Gray  code 
to  binary  code  in  an  n-cube  is  equivalent  to  reshaping 
a  one-dimensional  array  of  size  2**  to  an  n-dimensional 
array  of  shape  2  x  2  x  •  •  •  x  2. 

In  any  algorithm  according  to  Lemma  2  and  Theo¬ 
rem  5  only  half  of  the  communications  links  in  each 
of  the  n  —  1  dimensions  are  used  in  every  step  of  the 
algorithm.  Every  path  is  of  minimum  length,  and  all 
minimum  length  paths  are  used  evenly.  The  load  on  the 
communications  network  is  minimal. 

Coigecture  1  For  the  conversion  between  binary- 
reflected  Gray  code  and  binary  code  encodings  of  K  el¬ 
ements  per  processor  in  an  n-cube,  a  lower  bound  is 

if5trl. 

n 

For  0  =  2,  the  coiyectnre  follows  from  Theorem  3. 
For  n  >  2  only  the  most  significant  dimension  requires 
no  comnmnication. 

Corollary  S  The  conversion  between  binary-reflected 
Gray  code  and  binary  code  encoding  in  an  n-cube  can  be 
performed  as  an  arbitrary  sequence  of  communications 
in  dimensions:  {0, 1,  •  •  • ,  n  —  2}. 

The  corollary  follows  from  the  observation  that  the 
control  is  completely  determined  by  the  binary  encoding 
of  i. 

An  algorithm  proceeding  from  dimension  n  —  2  to  di¬ 
mension  0  is  depicted  in  Figure  4.  Initially,  processor 
Gi(t)  contains  data  of  index  t.  After  the  conversion,  t  is 
assigned  to  processor  £4(1).  The  algorithm  is  described 
below.  Several  other  algorithms  are  given  in  [7]. 

/*  Converting  Gray  code  to  binary  code 
starting  from  the  most  significant  dimension  */ 

for  d  :=  n  —  2  downto  0  do 
ifprf+i  =  1  then 

exch.  content  with  the  neighbor  in  dim.  d 
endif 
enddo 


The  control  in  the  above  algorithm  is  particularly  sim¬ 
ple,  since  the  following  corollary  follows  from  Lemma  3. 
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Figure  4:  Conversion  of  a  binary-reflected  Gray  code  to 
binary  code 


Corollary  4  If  iht  conversion  from  binary-reflected 
Gray  code  to  binary  code  proceeds  from  the  most  signif¬ 
icant  dimension  to  the  least  significant  dimension,  then 
the  current  value  of  assigned  to  an  address  is  equal 
to  g,n,  where  m  is  the  controlling  dimension. 

The  algorithm  is  easy  to  generalize  to  an  arbitrary 
starting  dimension  m,  m  €  2„_i  with  exchanges  in  suc¬ 
cessive  dimensions  of  decreasing  order  in  a  cycUc  fash¬ 
ion.  The  first  exchange  requires  the  computation  of  6„,. 
Figure  5  gives  an  example.  Sequence  2  is  the  same  as 
in  Figure  4.  The  figure  shows  the  location  of  i  for  each 
step  of  the  algorithm  for  each  sequence.  For  concurrent 
exclianges  the  local  data  set  K  is  divided  into  n  —  1 
sets,  and  set  m,  m  €  **  subject  to  exchange  in 

dimension  (n  -  2  -  m  —  t)  mod  (n  —  2)  during  step  t, 
t  € 

I*  Converting  Gray  code  to  binary  code  starting  from 
dimension  m.  Dimensions  in  decreasing  order,  cyclically*/ 
if  p„_,  ®  ®  •  •  ■  ®  =  1  then 

cxch.  content  with  the  neighbor  in  dim.  m 
endif 

for  d  :=  m  —  1  downto  0  do 
if  Qd-n  =  1  then 

exch.  content  with  the  neighbor  in  dim.  d 
endif 
enddo 

for  d  :=  n  —  2  downto  m  -(-  1  do 
ifprf+i  =  1  then 

exch.  content  with  the  neighbor  in  dim.  d 
endif 
enddo 
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Figure  5:  Concurrent  conversion  of  a  binary-reflected 
Gray  code  to  binary  code. 

5  Summary 

We  have  shown  that  the  splitting  of  a  binary-reflected 
Gray  code  encoded  axis  into  two  binary-reflected  Gray 
coded  axes  only  requires  an  exchange  in  the  most  signif¬ 
icant  dimension  of  the  lower  order  axis.  The  exchanges 
required  for  multiple  axis  splittings  can  be  performed  in 
arbitrary  order. 

Assume  concurrent  commmiication  on  all  ports,  K  el¬ 
ements  per  processor,  and  6  dimensions  requiring  com¬ 
munication  for  the  reshape  operation.  If  if  is  a  multiple 
of  S,  then  the  number  of  element  transfers  in  sequence 
is  independent  of  6.  An  upper  bound  is  K  and  a  lower 
bound  is  We  present  three  algorithms:  (i)  one  of 
communication  complexity  (ii)  one  of  complex¬ 
ity  -f  1  for  reshape  operations  for  which  no  two 

dimensions  requiring  conmmnication  are  adjacent  and 
K  >  26,  and  (iii)  and  one  of  complexity  25  —  1,  if 
there  is  one  unused  processor  dimension  of  higher  order 
for  every  processor  dimension  requiring  communication. 
The  previously  best  known  algorithm  has  a  complexity 
of  if  +  5  -  I  [6]. 

The  conversion  between  binary-reflected  Gray  code 
and  binary  code  encodings  is  a  special  case  of  reshaping 
an  array,  and  can  be  carried  out  on  an  n-cube  by  n  — 
1  exchanges  in  dimensions  0,  —  2  in  arbitrary 

order  with  a  complexity  of  at  most  (u  —  l)f  ]  element 
transfers  in  sequence. 
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Abstract 

By  randomly  seeding  individual  processors  in  a 
parallel  environment  with  unique  random  number  gen¬ 
erators  it  is  possible  to  take  full  advantage  of  the 
economies  of  scale  present  in  the  parallel  environment 
to  achieve  more  accurate  simulations.  While  a  sin¬ 
gle  random  number  generator  is  sufScient  for  a  serial 
computer,  the  same  is  not  true  for  a  parallel  computer. 
Multiple  copies  of  the  same  generator  do  not  improve 
the  quality  of  the  simulation  as  the  period  may  be 
insufficient  to  prevent  exhaustion  or  ’banding’  of  the 
variates.  Our  approach  is  to  provide  each  processor 
with  its  own  unique  random  number  generator  and 
use  a  common  seed  value.  This  ensures  each  simula¬ 
tion  is  unique  as  each  generator  is  different  due  to  ran¬ 
dom  assignment  by  the  front-end  computer.  The  lin¬ 
ear  congruential  method  was  chosen  due  to  widespread 
familiarity  and  acceptance  of  the  technique.  By  us¬ 
ing  a  sequence  of  random  numbers  generated  on  the 
front-end  computer,  prime  numbers  are  selected  from 
a  predefined  array  of  2048  primes  and  assigned  to  pro¬ 
cessors.  To  provide  maximum  possible  period  to  the 
generators,  all  2048  primes  in  the  array  are  six  digits 
in  size.  This  gives  the  researcher  the  ability  to  run 
simulations  involving  up  to  a  million  random  numbers 
with  a  high  degree  of  certainty  that  each  processor  is 
running  a  different  simulation.  By  taking  advantage  of 
the  large  periods,  the  economies  of  scale  available  on 
a  parallel  machine  can  then  be  expolited  to  run  large 
scale  simulations  involving  millions  of  numbers  which 
would  be  prohibitive  on  a  serial  machine. 

Random  Numbers  and  Simulation 

Simulation  has  achieved  a  new  level  of  impor¬ 
tance  in  the  modern  era.  Many  things  of  interest  to 
researchers  cannot  be  directly  viewed,  experimented 
upon,  or  actually  done  due  to  prohibitive  cost  or 
danger.  In  these  cases,  simulation  becomes  the  re¬ 
searcher’s  main  tool. 

Simulations  of  real  world  systems  are  usually 
mathematical  models.  Nuclear  plants,  national  eco¬ 
nomies,  and  plane  crashes  are  only  a  few  of  the  sys¬ 
tems  which  can  be  modeled  by  the  use  of  mathematics. 
However,  the  formulas  themselves  are  useless  without 
some  sort  of  input  to  drive  them.  Random  numbers 
drive  the  equations  and  produce  the  results. 


Random  numbers  are  usually  the  input  for  these 
models  since  the  models  are  based,  for  the  most  part, 
on  probabilities.  As  an  example,  a  splitting  atom  gives 
off  a  neutron,  which  has  a  probability  k  of  hitting  an¬ 
other  atom  and  forcing  it  to  split.  There  is  also,  how¬ 
ever,  the  probability  1  —  ib  that  the  neutron  will  not 
hit  another  atom  and  the  reaction  may  die. 

This  method  of  simulation  is  known  as  stochas¬ 
tic  simulation  but  usually  referred  to  as  Monte  Carlo 
simulation.  Monte  Carlo  involves  sampling  from  a  spe¬ 
cific  distribution,  usually  the  Uniform  (0,1],  as  these 
numbers  approximate  probabilites.  The  samples  are 
then  used  to  estimate  the  value  of  an  integral,  even 
in  cases  where  the  integral  may  not  be  readily  appar¬ 
ent.  Random  numbers  must  be  generated  somehow, 
as  computers,  where  most  of  the  simulations  are  now 
done,  do  not  have  built-in  tables  of  random  numbers. 
It  is  for  this  purpose  that  the  congruential  generator 
methods  were  developed. 

The  Linear  Congruential  Method 

For  the  purpose  of  this  paper,  the  Linear  Con¬ 
gruential  Generator  method  was  employed.  Moti¬ 
vating  our  choice  of  the  LCG  was  ease  of  program¬ 
ming,  debugging,  and  understanding.  The  congruen¬ 
tial  generation  method  is  probably  the  most  commonly 
used  method  for  generating  random  numbers  today. 
For  a  complete  explanation  of  the  method  see  either 
Kennedy  and  Gentle  [2]  or  Rubinstein  [4].  For  those 
who  may  not  be  acquainted  with  these  methods  we 
shall  provide  a  brief  overview. 

Congruential  methods  generate  pseudo-random 
numbers  by  using  a  recursive  formula.  The  numbers 
generated  are  called  pseudo-random  since  they  are,  per 
force,  deterministic.  Given  the  same  formula,  the  same 
seed  value  will  always  generate  the  same  sequence  of 
“random”  numbers.  This  is  not  truly  unfortunate 
since  it  make  replicating  results  possible,  which  a  truly 
random  sequence  of  numbers  would  make  impossible. 

The  general  form  of  a  congruential  generator  is 

Zi  =  -I-  c)  (mod  m)  for  «  =  1, 2, . . . 

where 

Xi  -  the  next  random  in  the  sequence 
a  -  the  multiplier 
c  -  the  increment 
m  -  the  modulus 
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and  0  <  X,-  <  m.  The  value  for  x  at  :  =  0  is  referred 
to  as  the  seed  and  is  provided  by  the  user.  a,c,  and 
m  must  all  be  nonnegative. 

The  generator  can  have  a  maximal  period,  the 
interval  between  repeating  values,  of  m  if  and  only  if 
the  following  conditions  are  met;  [2] 

1.  c  is  relatively  prime  to  m 

2.  a  =  1  (mod  p)  for  every  prime  factor  p  of  m 

3.0  =  1  (mod  4)  if  4  is  a  factor  of  m 

In  reality,  finding  a  quadruple  {a,c,p,Tn)  would 
take  too  much  time  to  be  justifiable.  A  reasonable 
approximation  for  c  can  be  made  by 

c  =  (1/2  -  1/6  *  \/3)  ♦  m 
which  was  proposed  by  Knuth  [3]. 

The  Current  Approach 

Fox  et.  al.  [1]  suggest  using  the  linear  congru- 
ential  generator  method,  but  adapting  it  to  the  par¬ 
allel  architecture  of  a  hypercube.  The  approach  they 
propose  is  to  load  each  node  with  the  same  generator 
but  have  nodes  “leapfrog”  each  other.  This  staggering 
effect  is  obtained  by  having  each  node  step  into  the 
sequence  of  randoms  n  variates,  where  n  is  the  proces¬ 
sor’s  number,  starting  at  zero.  Formally,  if  there  are 
p  nodes,  this  means  that  node  0  gets  xq  as  a  random 
value,  node  1  gets  Xi,  node  2  gets  X2,  ....  node  p  —  1 
gets  Xp_i,  node  0  gets  Xp,  node  1  gets  Xp.fi  > 
on. 

While  this  accomplishes  the  task  of  generating 
random  numbers  for  each  node,  and  parallelizes  the 
task,  it  carries  some  inherent  problems. 

First,  this  method  uses  only  one  random  number 
generator.  While  some  randomization  of  the  nodes 
is  introduced  by  the  stepping  algorithm,  the  fact  that 
only  one  sequence  of  numbers  is  being  used  is  not  over¬ 
come.  This  can  have  debilitating  side  effects.  If  the  pe¬ 
riod  is  not  large  enough,  the  nodes  might  overlap,  pro¬ 
ducing  multiple  simulations  which  are  identical.  True, 
this  is  a  pathological  example,  but  even  if  the  period 
is  large  enough  to  prevent  overlap,  banding  will  occur 
in  the  randoms.  By  banding,  we  mean  that  the  points 
(xo,xi),  (x2,  X3), . . .  when  plotted  produce  bands,  in¬ 
dicating  a  high  degree  of  correlation. 

It  is  possible  for  a  simulation  to  exhaust  the  pe¬ 
riod  of  the  random  number  generator  since  the  period 
is  reduced  in  size  to  m/n  where  n  is  the  number  of 
nodes  in  a  simulation.  In  such  cases,  the  generator 
would  begin  generating  the  same  randoms  again,  due 
to  the  deterministic  nature  of  the  algorithm.  Since  the 
nodes  are  staggered,  each  node  would  begin  to  pro¬ 
gressively  exhaust  the  period  of  its  generator.  Nodes 


would  then  begin  to  rerun  simulations  which  had  al¬ 
ready  been  run  by  other  nodes  thereby  producing  iden¬ 
tical  results. 

Fox’s  method  does  provide  a  way  of  generating 
random  numbers  for  parallel  simulations,  but  it  is  a 
method  open  to  needless  redundancy  of  effort. 

A  Second  Approach 

As  an  alternative  to  the  method  proposed  by  Fox 
et.  al.  we  present  the  following  method.  Where  Fox 
uses  the  same  random  number  generator  on  all  nodes, 
effectively  duplicating  the  same  work  on  all  nodes,  we 
provide  each  node  with  its  own,  unique,  random  num¬ 
ber  generator.  This  has  the  effect  of  giving  each  node 
a  different  simulation  to  run,  and  consequently,  a  dif¬ 
ferent  answer  for  the  simulation.  This  avoids  uneces- 
sary  duplication  of  work  and  reduces  the  possibility 
of  exhausting  the  period  or  “banding”  of  the  random 
numbers. 

Implementation  is  fairly  simple.  Large  primes 
are  generated  and  stored  on  the  host  computer.  By 
Ictrge,  we  mean  primes  of  at  least  six  digits,  the  ratio¬ 
nale  being  that  the  size  of  the  modulus  determines  the 
maximum  period  of  the  generator  and  we  wish  to  pro¬ 
vide  a  workable  period  for  large  simulations.  At  load 
time,  the  host  program  selects  a  prime  at  random  from 
the  stored  numbers  and  passes  is  to  the  node  being 
loaded.  Selection  of  the  prime  is  done  using  a  random 
number  generated  by  the  host  program.  The  method 
of  generation  for  this  random  is  arbitrary  and  could  be 
accomplished  using  a  builtin  random  number  genera¬ 
tor  such  as  rand().  To  ensure  reproducibility  of  results, 
the  seed  used  to  initialize  the  prime  selection  generator 
is  the  same  used  to  seed  the  individual  node  random 
number  generators.  The  prime  passed  to  the  nodes  is 
used  as  the  modulus  in  the  LCG,  thereby  making  the 
random  number  generator  for  node  i  different  from  the 
generator  for  node  j. 

Results 

The  method  was  tested  on  the  NCUBE/Ten  here 
at  University  of  South  Carolina.  Two  different  simu¬ 
lations  were  run  :  estimation  of  x  and  integral  estima¬ 
tion.  In  both  cases,  the  results  from  the  simulations 
produced  values  which  were  correct  to  two  decimal 
places.  Greater  precision  was  not  attainable  although 
several  attempts  were  made. 

Results  from  the  parallel  method  were  obtained 
using  cubes  of  size  64,  128,  256,  and  512  performing 
10000  simulations  each.  The  seed  values  were  taken 
from  a  table  of  primes. 
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Estimation  of  k 


Table  3  -  Hit-or-Miss  Estimates 

This  test  is  the  same  one  detailed  in  Fox.  Two 
random  variates  are  generated  from  the  Uniform  (0,1] 
distribution.  The  values  are  squared  and  then  summed. 

If  the  result  is  less  than  one,  a  success  is  recorded.  At 
the  completion  of  the  simulation,  the  total  number  of 
successes  is  multiplied  by  four  and  divided  by  the  to¬ 
tal  number  of  simulations  performed.  The  result  is  a 
rough  approximation  of  v  (3.1 1159). 


Seed 

64 

128 

256 

512 

7741 

0.3332 

0.3324 

0.3320 

03312 

15017 

0.3367 

03378 

0.3383 

0.3379 

20117 

0.3336 

0.3335 

0.3334 

0.3328 

31271 

0.3375 

0.3382 

0.3386 

0.3381 

92857 

0.3324 

0.3326 

0.3327 

0.3327 

Table  1  —  Estimates  of  k  Execution  Time 


Execution  time  was  averaged  for  the  first  test  (es¬ 
timation  of  it).  The  timing  results  are  in  the  following 
table.  Note  that  the  times  do  not  include  startup  and 
that  there  is  no  node-to-node  communication  which 
leads  to  very  high  efficiencies.  All  efficiency  estimates 
were  calculated  using 


Seed 

64 

128 

256 

512 

7741 

3.1379 

3.1427 

3.1450 

3.1462 

15017 

3.1398 

3.1396 

3.1396 

3.1395 

20117 

3.1400 

3.1400 

3.1405 

3.1407 

31271 

3.1457 

3.1457 

3.1456 

3.1456 

92857 

3.1393 

3.1384 

3.1380 

3.1377 

Integral  Estimation 


efficiency 


timcj 
p  ♦  timep 


This  test  consisted  of  two  different  methods  of  in¬ 
tegral  estimation  :  Monte  Carlo  and  Hit-or-Miss.  The 
target  value  for  both  tests  in  this  simulation  was  1/3 
(.333333). 

The  Monte  Carlo  test  was  done  by  generating 
samples  of  size  200  from  the  Uniform  (0,1]  distribution. 
Variates  from  the  samples  were  squared  and  summed 
and  then  multiplied  by  the  number  of  variates  in  each 
sample  to  arrive  at  the  estimation  of  the  integral. 

The  Hit-or-Miss  test  used  pairs  of  random  vari¬ 
ates  from  the  Uniform  (0,1]  distribution.  For  each  pair, 
if  ifc  is  less  than  xl_^  then  add  one  to  the  success 
count.  The  total  number  of  successes  is  then  divided 
by  the  number  of  pairs  to  arrive  at  the  estimate  for 
the  integral. 


Table  2  -  Monte  Carlo  Estimates 


Table  4  -  Execution  Times 


Dimension 

Time  (sec) 

Efficiency 

0 

281.8257 

1.0000 

1 

140.9265 

0.9999 

2 

70.6814 

0.9968 

3 

35.3558 

0.9964 

4 

17.7055 

0.9948 

5 

8.8612 

0.9939 

6 

4.4454 

0.9906 

7 

2.2376 

0.9840 

8 

1.1336 

0.9712 

9 

0.5816 

0.9464 

10 

0.3055 

0.9009 

Seed 

64 

128 

256 

512 

7741 

0.3339 

0.3336 

0.3334 

0.3327 

15017 

0.3316 

0.3312 

0.3310 

0.3302 

20117 

0.3338 

0.3338 

0.3338 

0.3331 

31271 

0.3323 

0.3320 

0.3320 

0.3313 

92857 

0.3341 

0.3341 

0.3341 

0.3341 

These  results  are  from  simulations  of  size  512,000. 
Thus  in  the  dimension  d  case,  each  node  generated 
random  numbers.  The  figure  below  shows  the 
efficiencies  from  the  above  table. 
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Conclusions 


Figure  1 
Estimation  of  T 


Introduction  of  Shuffling 

While  the  LCG  method  is  sufficient,  it  is  not  al¬ 
ways  optimal.  In  many  cases,  a  marginal,  or  even  bad, 
random  number  generator  can  be  improved  with  the 
addition  of  a  shuffling  algorithm  such  as  that  devel¬ 
oped  by  Bays  and  Durham  (see  [2]).  In  this  case,  the 
LCG  method  generated  variates  that  were  sufficiently 
random  to  produce  acceptable,  if  not  outstanding,  re¬ 
sults.  After  the  inclusion  of  the  Bays-Durham  shuf¬ 
fling  algorithm  to  further  randomize  the  variates,  the 
results  improved.  Almost  uniformly,  the  precision  of 
the  results  increased. 


Table  5  -  Estimates  of  tt  with  Shuffling 


Seed 

64 

128 

256 

512 

7741 

3.1406 

3.1439 

3.1454 

3.1463 

15017 

3.1380 

3.1387 

3.1390 

3.1392 

20117 

3.1412 

3.1408 

3.1412 

3.1414 

31271 

3.1434 

3.1433 

3.1433 

3.1432 

92857 

3.1387 

3.1381 

3.1379 

3.1377 

What  we  have  shown  with  this  simple  example 
is  that,  while  no  dramatic  improvement  is  evident  in 
accuracy,  the  parallel  method  is  as  good  as  the  serial 
method.  Since  there  is  no  loss  of  precision  due  to  the 
parallelization  of  the  simulation,  the  researcher  could 
take  advantage  of  the  economies  of  scale  inherent  in 
the  parallel  processor  to  run  truly  large  simulations 
which  would  be  prohibitive  on  a  serial  machine.  This 
technique  also  provides  a  method  by  which  the  accu¬ 
racy  of  simulations  can  be  tested  by  running  multiple 
copies  in  the  time  it  would  take  to  run  a  single  copy. 

It  should  be  possible  to  overcome  the  precision 
threshold  and  thereby  achieve  more  accurate  results 
from  the  parallel  simulations.  Even  with  the  thresh¬ 
old,  however,  the  results  indicate  that  reliably  inde¬ 
pendent  simulations  can  be  run  on  parallel  machines 
using  independent  random  number  generators. 
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Abstract 

Parallel  computers  are  ideally  suited  to  the 
Monte  Carlo  simulation  of  spin  models  using  the 
standard  Metropolis  algorithm,  since  it  is  regular 
and  local.  However  local  algorithms  have  the  ma¬ 
jor  drawback  that  near  a  phase  transition  the  num¬ 
ber  of  sweeps  needed  to  generate  a  statistically  in¬ 
dependent  configuration  increases  as  the  square  of 
the  lattice  sise.  New  algorithms  have  recently  been 
developed  which  dramatically  reduce  this  ‘critical 
slowing  down’  by  updating  clusters  of  spins  at  a 
time.  The  highly  irregular  and  non  local  nature  of 
these  algorithms  means  that  they  are  much  more 
difficult  to  parallelize  efficiently.  Here  we  introduce 
the  new  cluster  algorithms,  explain  some  sequential 
algorithms  for  identifying  and  labelling  connected 
clusters  of  spins,  and  then  outline  some  parallel  al¬ 
gorithms  which  have  been  implemented  on  MIMD 
machines. 


1.  Introduction 

Computer  simulations  are  extremely  useful  in 
the  study  of  spin  models  in  condensed  matter 
physics.  In  these  models  the  spins  are  usually  set 
up  on  the  sites  of  a  d-dimensional  hypercubic  lattice 
of  length  L.  The  spins  form  some  configuration. 
The  goal  of  computer  simulations  is  to  generate  con¬ 
figurations  of  spins  typical  of  statistical  equilibrium 
and  measure  physical  observables  o:i  this  ensemble 
of  configurations.  This  is  traditionally  performed 
by  Monte  Carlo  methods  such  as  the  Metropolis 
algorithm  [Ij  ,  which  produce  configurations  with 
a  probability  given  by  the  Boltzmann  distribution 
e“^^(^),  where  S{4>)  is  the  action,  or  energy,  of  the 
system  in  configuration  and  is  the  inverse  tem¬ 
perature.  One  of  the  main  problems  with  these 
methods  in  practice  is  that  successive  configurations 


are  not  statistically  independent,  but  rather  are  cor¬ 
related,  with  some  autocorrelation  time  r  between 
effectively  independent  configurations. 

A  key  feature  about  traditional  (Metropolis¬ 
like)  Monte  Carlo  algorithms  is  that  the  updates 
are  locals  since  the  procedure  to  update  a  given  spin 
depends  only  on  the  vzilues  of  spins  which  affect  its 
contribution  to  the  action,  and  most  spin  models 
have  local  (usually  nearest  neighbor)  interactions. 
Thus  in  a  single  step  of  the  algorithm,  information 
about  the  state  of  a  spin  is  transmitted  only  to  its 
near  neighbors.  In  order  for  the  system  to  reach 
a  new  effectively  independent  configuration,  this  in¬ 
formation  must  travel  a  distance  of  order  the  (static 
or  spatial)  correlation  length  As  the  informa¬ 
tion  executes  a  random  walk  around  the  lattice,  one 
would  suppose  that  the  autocorrelation  time  t  ~ 
However,  in  general  t  ~  where  z  is  called  the 
dynamical  critical  exponent.  Almost  all  numerical 
simulations  of  spin  models  have  measured  z  n  2  for 
local  update  algorithms. 

For  a  spin  model  with  a  phase  transition,  as  the 
inverse  temperature  approaches  its  criticzil  value, 
(  diverges  to  infinity,  so  that  the  computational  ef¬ 
ficiency  rapidly  goes  to  zero!  This  behavior  is  called 
critical  slowing  down  (CSD),  and  until  very  recently 
it  has  plagued  Monte  Carlo  simulations  of  statisti¬ 
cal  mechanical  systems,  in  particular  spin  models, 
at  or  near  their  phase  transitions.  Recently,  how¬ 
ever,  several  new  ‘cluster  2tlgorithms’  have  been  in¬ 
troduced  which  decrease  z  dramatically  by  perform¬ 
ing  non-local  spin  updates,  thus  reducing  (or  even 
eliminating)  CSD  and  facilitating  much  more  effi¬ 
cient  computer  simulations. 

Parallel  computers  have  been  very  successfully 
applied  to  the  Monte  Carlo  simulation  of  spin  mod¬ 
els  using  the  traditional  algorithms  such  as  that  of 
Metropolis.  These  algorithnu  are  easily  and  effi¬ 
ciently  parallelised  using  domain  decomposition  of 
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the  lattice,  since  they  are  very  regular  (and  hence 
perfectly  load  balanced),  and  only  require  a  small 
amount  of  local  communication  between  processors. 
The  new  cluster  algorithms,  on  the  other  hand,  are 
highly  irregular  and  non-local,  and  are  therefore 
much  more  difficult  to  parallelize  efficiently.  Here 
we  introduce  the  cluster  update  algorithms,  explain 
some  sequential  algorithms  for  identifying  and  la¬ 
beling  connected  clusters  of  spins,  and  then  outline 
some  parallel  algorithms  which  have  been  imple¬ 
mented  on  MIMD  machines. 

2.  Cluster  algorithms 

The  aim  of  the  cluster  update  algorithms  is 
to  find  a  suitable  collection  of  spins  which  can  be 
flipped  with  relatively  little  cost  in  energy.  We  could 
obtain  non-local  updating  very  simply  by  using  the 
standard  Metropolis  Monte  Carlo  algorithm  to  flip 
randomly  selected  bunches  of  spins,  but  then  the 
acceptance  would  be  tiny.  We  need  a  method  which 
picks  sensible  bunches  or  clusters  of  spins  to  be  up¬ 
dated.  The  first  such  algorithm  was  proposed  by 
Swendsen  and  Wang  [2] ,  and  was  based  on  an  equiv¬ 
alence  between  a  Potts  spin  model  [3]  and  percola¬ 
tion  models  [4]  ,  for  which  cluster  properties  play  a 
fundamental  role. 

The  Potts  model  is  a  very  simple  spin  model  of 
a  ferromagnet,  in  which  the  spins  can  take  q  differ¬ 
ent  values.  The  case  9  =  2  is  just  the  well-known 
Ising  model.  In  the  Swendsen  and  Wang  algorithm, 
clusters  of  spins  are  created  by  introducing  bonds 
between  neighboring  spins  with  probability  1  —  e~^ 
if  the  two  spins  are  the  same,  and  zero  if  they  are 
not.  All  such  clusters  are  created,  and  then  updated 
by  choosing  a  random  new  spin  value  for  each  clus¬ 
ter  and  assigning  it  to  all  the  spins  in  that  cluster. 

A  slightly  different  cluster  algorithm  has  been 
proposed  by  Wolff  [5]  .  In  this  algorithm,  a  spin  is 
chosen  at  random  and  a  single  cluster  constructed 
around  it,  using  the  same  bond  probabilities  as  for 
the  Swendsen- Wang  algorithm.  All  the  spins  in  this 
cluster  are  then  flipped  (i.e.  changed  to  a  random 
new  spin  different  from  the  old  one).  Although 
Wolff’s  method  is  probably  the  best  sequential  clus¬ 
ter  algorithm,  the  Swendsen  and  Wang  algorithm 
seems  to  be  better  suited  for  parallelization,  since  it 
involves  the  entire  lattice  rather  than  just  a  single 
cluster.  We  have  therefore  concentrated  our  atten¬ 
tion  on  the  method  of  Swendsen  and  Wang,  where 
all  the  clusters  must  be  identified  and  labeled. 

3.  Cluster  identification 

Cluster  algorithms  have  in  common  the  prob¬ 
lem  of  identifying  and  labeling  the  connected  clus¬ 


ters  of  spins.  This  is  very  similar  to  an  important 
problem  in  image  processing,  that  of  identifying  and 
labeling  the  connected  components  in  a  biniiry  or 
multi-colored  im^e  composed  of  an  array  of  pixels. 
The  only  real  difference  is  that  in  the  spin  model 
case,  neighboring  sites  of  the  same  spin  have  a  cer¬ 
tain  probability  of  being  in  the  same  cluster,  while 
for  neighboring  pixels  of  the  same  color  that  prob¬ 
ability  is  one.  Unfortunately  this  is  a  large  enough 
difference  so  that  some  algorithms  which  work  in 
image  analysis  will  not  work  (or  require  substantial 
changes)  for  spin  models. 

First  we  mention  some  sequential  algorithms 
for  labeling  clusters  of  connected  sites.  Perhaps  the 
most  obvious  method  for  identifying  a  single  clus¬ 
ter  is  the  so-called  ‘ants  in  the  labyrinth’  algorithm. 
The  reason  for  its  name  is  that  we  can  visualize  the 
algorithm  as  follows  [6]  .  An  ant  is  put  somewhere 
on  the  lattice,  and  notes  which  of  the  neighboring 
sites  are  connected  to  the  site  it  is  on.  At  the  next 
time-step  this  ant  places  children  on  each  of  these 
connected  sites  which  are  not  already  occupied.  The 
children  then  proceed  to  reproduce  likewise  until  the 
entire  cluster  is  populated.  In  order  to  label  all  the 
clusters,  we  start  by  ^ving  every  site  a  negative  la¬ 
bel,  set  the  initial  cluster  label  to  be  zero,  and  then 
loop  through  all  the  sites  in  turn.  If  a  site’s  label 
is  negative  then  the  site  has  not  already  been  as¬ 
signed  to  a  cluster,  so  we  place  an  ant  on  this  site, 
give  it  the  current  cluster  label,  and  let  it  repro¬ 
duce,  pjissing  the  label  on  to  all  its  offspring.  When 
this  cluster  is  identified  we  increment  the  cluster 
label  and  carry  on,  repeating  the  ant-colony  birth, 
growth  and  death  cycle  until  all  the  clusters  have 
been  identified. 

An  alternative  method  which  is  commonly  used 
(especially  for  cluster  identification  in  percolation 
models)  is  that  of  Hoshen  and  Kopelman  [7]  .  We 
have  found  that  ‘ants’  gives  slightly  better  perfor¬ 
mance  than  this  algorithm,  and  so  we  will  not  dis¬ 
cuss  it  further. 

Identifying  and  labeling  clusters  of  connected 
sites  in  a  lattice  is  a  special  case  of  the  more  general 
problem  known  variously  as  the  set  union,  union- 
find,  or  equivalence  problem,  i.e.  given  a  list  of 
equivalences  between  elements,  sort  the  elements 
into  equivalence  classes.  In  the  context  of  cluster  al¬ 
gorithms,  the  list  of  equivalences  is  just  a  list  of  the 
sites  which  are  connected  together,  and  the  equiva¬ 
lence  classes  are  just  the  clusters.  There  are  a  mul¬ 
titude  of  algorithms  for  this  problem  [8]  .  We  have 
used  an  elegant  and  easy  to  code  method  due  to 
Galler  and  Fisher  [9]  ,  which  goes  as  follows.  Let 
F{j)  be  the  class  or  ‘family’  label  of  element  j.  We 
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start  off  with  each  element  in  its  own  family,  so 
that  F(j)  =  j.  The  array  F{j)  can  be  interpreted 
as  a  tree  structure,  where  F{j)  denotes  the  parent 
of  j.  If  we  arrange  for  each  family  to  be  its  own 
tree,  disjoint  from  all  other  ‘family  trees’,  then  we 
can  label  each  family  by  its  most  senior  great-great- 
.. .grandparent.  Therefore  we  process  each  equiva¬ 
lence  of  two  sites  j  and  k  by 

(1)  tracking  j  up  to  its  highest  ancestor, 

(2)  tracking  k  up  to  its  highest  ancestor, 

(3)  giving  J  to  k  as  a  new  parent 

After  processing  all  the  equivalence  relations,  we  go 
through  all  the  elements  j  and  reset  their  F(j)’s 
to  their  highest  possible  ancestors,  which  then  label 
the  equivalence  classes,  so  that  F{j)  is  the  cluster 
label  of  site  j. 

4.  Parallel  algorithms 

As  with  the  percolation  models  upon  which  the 
cluster  algorithms  are  based,  the  phase  transition 
in  a  spin  model  occurs  when  the  clusters  of  bonded 
spins  become  large  enough  to  span  the  entire  lat¬ 
tice  [4].  Thus  near  criticality  (which  in  most  cases 
is  where  we  want  to  perform  the  simulation)  clus¬ 
ters  come  in  all  sizes,  from  order  N  (where  N  is  the 
number  of  sites  in  the  lattice)  right  down  to  a  sin¬ 
gle  site.  The  highly  irregular  and  non-local  nature  of 
the  clusters  means  that  cluster  update  algorithms  do 
not  vectorize,  and  hence  give  poor  performance  on 
vector  machines.  On  this  problem  a  CRAY  X-MP 
is  only  about  ten  times  faster  than  a  SUN4  work¬ 
station.  SIMD  machines  are  similarly  unsuited  to 
this  problem,  whereas  for  the  Metropolis  type  algo¬ 
rithms  they  are  perhaps  the  best  machines  available. 
It  therefore  appears  that  the  optimum  performance 
for  this  type  of  algorithm  will  come  from  MIMD 
parallel  computers. 

Using  the  trivial  parallelization  technique  of 
running  independent  Monte  Carlo  simulations  on 
different  processors,  it  is  possible  to  do  better  than 
a  CRAY  on  a  typical  MIMD  parallel  computer  with 
only  about  20  nodes.  This  method  works  well  until 
the  lattice  size  gets  too  big  to  fit  into  the  memory  of 
each  node,  and  in  fact  we  have  used  this  method  to 
calculate  the  dynamical  critical  exponents  of  various 
cluster  algorithms  [10]  .  However  in  the  case  of  the 
Potts  model,  for  example,  only  lattices  of  size  less 
than  about  300^  or  50^  will  fit  into  1  Mbyte,  and 
most  other  spin  models  are  more  complicated  and 
more  memory  intensive.  We  therefore  need  a  paral¬ 
lel  algorithm  where  a  large  lattice  can  be  distributed 
over  many  processors. 


A  parallel  cluster  algorithm  involves  distribut¬ 
ing  the  lattice  onto  an  array  of  processors  using 
the  usual  domain  decomposition.  Clearly  a  sequen¬ 
tial  algorithm  can  be  used  to  label  the  clusters  on 
each  processor,  but  vre  need  a  procedure  for  con¬ 
verting  these  labels  to  their  comet  global  values. 
We  need  to  be  able  to  tell  many  processors,  which 
may  be  any  distance  apart,  that  some  of  their  clus¬ 
ters  are  eictually  the  same.  Thus  we  need  to  be 
able  to  agree  on  which  of  the  many  different  local 
labels  for  a  given  cluster  should  be  assigned  to  be 
the  global  cluster  label,  and  to  pass  this  label  to  all 
the  processors  containing  a  part  of  that  cluster.  We 
will  discuss  two  methods  for  tackling  this  problem, 
‘self-labeling’  and  ‘global  equivalencing’,  and  briefly 
mention  some  other  algorithms  for  labeling  clusters 
in  parallel. 

4.1.  Self-labeling 

We  shall  refer  to  this  algorithm  as  ‘self-labeling’, 
since  each  site  figures  out  which  cluster  it  is  in  by 
itself  from  local  information.  We  begin  by  assigning 
each  site  i  a  unique  cluster  label  5,.  In  practice  this 
is  simply  chosen  as  the  position  of  that  site  in  the 
lattice.  At  each  step  of  the  algorithm,  in  parallel, 
every  site  looks  in  turn  at  each  of  its  neighbors  in 
the  positive  directions.  If  it  is  bonded  to  a  neigh¬ 
boring  site  n  which  has  a  diflerent  cluster  label  Sn, 
then  both  Si  and  Sn  are  set  to  the  minimum  of  the 
two.  This  is  continued  until  nothing  changes,  by 
which  time  all  the  clusters  will  have  been  labeled 
with  the  minimum  initial  label  of  all  the  sites  in  the 
cluster.  Note  that  to  check  termination  of  the  algo¬ 
rithm  involves  each  processor  sending  a  termination 
flag  (finished  or  not  finished)  to  every  other  proces¬ 
sor  after  each  step,  which  can  become  very  costly 
for  a  large  processor  array. 

We  can  improve  this  method  by  using  a  faster 
sequential  algorithm,  such  as  ‘ants  in  the  labyrinth’, 
to  label  the  clusters  in  the  sublattice  on  each  proces¬ 
sor,  and  then  just  use  self-labeling  on  the  sites  at  the 
edges  of  each  processor  to  eventually  arrive  at  the 
global  cluster  labels.  The  number  of  steps  required 
to  do  the  self-labeling  will  depend  on  the  largest 
cluster,  which  at  the  phase  transition  will  generally 
span  the  entire  lattice.  The  number  of  self-labeling 
steps  will  therefore  be  of  the  order  of  the  maximum 
distance  between  processors,  which  for  a  square  ar¬ 
ray  of  P  processors  is  just  2\/P.  Hence  the  amount 
of  communication  (and  calculation)  involved  in  do¬ 
ing  the  self-labeling,  which  is  proportional  to  the 
number  of  iterations  times  the  perimeter  of  the  sub- 
lattice,  goes  like  L  for  an  LxL  lattice,  whereas  the 
time  taken  on  each  processor  to  do  the  local  cluster 
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labeling  goes  like  the  area  of  the  sublattice,  which 
is  I?  jP.  Therefore  as  long  as  X  is  substantially 
greater  than  the  number  of  processors  we  can  ex¬ 
pect  to  obtain  a  reasonable  speedup. 

The  speedups  obtained  on  the  Symult  2010  for 
a  variety  of  lattice  sizes  are  shown  in  Fig.  1.  The 
dashed  line  indicates  perfect  speedup  (i.e.  100%  effi¬ 
ciency).  The  lattice  sizes  for  which  we  actually  need 
large  numbers  of  processors  are  of  the  order  of  512^ 
or  greater,  and  we  can  see  that  running  on  64  nodes 
(or  running  multiple  simulations  of  64  nodes  each) 
gives  us  quite  acceptable  efficiencies  of  about  70% 
for  512*  and  80%  for  1024*.  In  Table  1  we  show  a 
comparison  of  times  for  one  update  of  a  512*  lattice 
using  self-labeling  on  various  MIMD  parallel  comr 
puters,  and  compare  this  with  results  for  the  fastest 
algorithm  on  a  SUN  workstation  and  a  CRAY  X- 
MP.  The  time  on  the  CRAY  is  taken  from  Wolff 
[11]  .  Note  that  using  all  512  nodes  of  Caltech’s 
NCUBE,  by  running  multiple  64  node  simulations, 
gives  a  performance  approximately  five  times  that  of 
the  CRAY,  while  all  192  nodes  of  the  Symult  S2010 
is  equivalent  to  about  six  CRAY s  for  this  problem. 


Machine 

Nodes 

Time  (sec) 

SUN-4 

1 

16.0 

CRAY  X-MP 

1 

1.7 

NCUBE/1 

64 

2.8 

Symult 

64 

0.82 

Meiko 

32 

1.2 

Table  1.  Times  for  one  update  of  a  5)  2*  lattice  using 
the  Swendsen  and  Wang  cluster  algorithm.  Self¬ 
labeling  is  used  on  the  parallel  machines. 

4.2.  Global  equivalencing 

In  this  method  we  agrdn  use  the  fastest  sequen¬ 
tial  algorithm  to  identify  the  clusters  in  the  sublat¬ 
tice  on  every  node.  Each  node  then  checks  to  see 
which  of  the  edge  sites  of  its  sublattice  are  connected 
to  edge  sites  on  the  neighboring  nodes  in  the  positive 
directions,  and  are  therefore  part  of  the  same  cluster 
and  should  be  given  the  same  cluster  label.  These 
Ibts  of  ‘equivalences’  are  all  passed  to  one  of  the 
nodes,  which  uses  the  equivalence  class  algorithm 
of  Fisher  and  Caller  [9]  to  match  up  the  connected 
subclusters,  and  then  broadcasts  thr  global  cluster 
labels  to  all  the  other  nodes. 

The  problem  here  is  that  the  equivalencing  is 
purely  sequential,  and  is  thus  a  potentially  disas¬ 
trous  bottleneck  for  large  numbers  of  processors. 
The  amount  of  work  involved  goes  like  the  number 
of  processors  P  times  the  perimeter  of  the  sublattice 
on  each  node,  so  that  the  efficiency  should  be  less 


than  for  self-labeling,  although  we  might  still  expect 
reasonable  speedups  if  the  number  of  nodes  is  not 
extremely  large.  The  speedups  obtained  for  thb  al¬ 
gorithm  on  the  Symult  2010  for  a  variety  of  lattice 
sizes  are  shown  in  Fig.  2.  Global  equivalencing  gives 
about  the  same  speedups  as  self-labeling  for  small 
numbers  of  processors,  but  as  expected  self-labeling 
does  much  better  as  the  number  of  nodes  increases. 

To  get  around  this  sequential  bottleneck  we 
need  to  adopt  a  hierarchical  divide-and-conquer  ap¬ 
proach,  where  the  equivalence  classes  are  built  up  in 
stages.  In  this  approach  the  processor  array  is  di¬ 
vided  up  into  smaller  subarrays  of,  for  example,  2x2 
processors.  In  each  subarray,  the  processors  look 
at  the  edges  of  their  neighbors  for  clusters  which 
are  connected  across  processor  boundaries.  These 
equivalences  are  all  passed  to  one  of  the  nodes  of  the 
sub-array,  which  places  the  cluster  labels  in  equiva¬ 
lence  classes  as  before.  The  results  of  these  partial 
matchings  are  similarly  combined  across  the  edges 
of  erich  4x4  subarray,  and  this  process  is  contin¬ 
ued  until  finally  all  the  partial  results  are  merged 
together  on  a  single  processor  to  give  the  global 
cluster  values,  which  are  then  passed  back  through 
the  hierarchy  of  levels.  This  type  of  algorithm  has 
been  implemented  on  a  hypercube  for  the  image  pro¬ 
cessing  component  labeling  problem  by  Embrechts 
ei  aL  [12]  .  We  are  currently  implementing  the 
hierarchical  global  equivalencing  algorithm  for  the 
spin  model  case,  which  should  do  better  than  self¬ 
labeling  for  large  numbers  of  processors,  since  the 
number  of  steps  required  goes  like  the  logarithm 
(rather  than  the  square  root)  of  the  number  of  pro¬ 
cessors. 

4.3  Other  algorithms 

Currently  the  only  other  parallel  cluster  algo¬ 
rithm  implemented  for  spin  models  is  a  parallel  ex¬ 
tension  of  the  Hoshen  ^md  Kopelmann  algorithm  [7] 
due  to  Burkitt  and  Heermann  [13]  ,  but  it  is  much 
more  complicated,  and  less  efficient,  than  the  self¬ 
labeling  algorithm.  There  have  been  many  different 
parallel  algorithms  proposed  for  the  connected  com¬ 
ponent  labeling  problem  in  image  analysis.  Some  of 
these  algorithms  are  aimed  at  shared  memory  [14] 
or  SIMD  [15]  [16]  architectures,  but  could  probably 
be  implemented  on  distributed  memory  MIMD  ma¬ 
chines  also.  Others  are  based  on  MIMD  machines 
such  as  hypercubes  [17] .  These  algorithms  also  need 
investigation  to  see  if  they  might  be  applied  to  the 
problem  of  producing  a  more  efficient  p2irallei  clus¬ 
ter  algorithm  for  spin  models  on  large  numbers  of 
processors. 
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Fig.  1.  Speedups  for  self- labeling  on  the  Symult 
S2010. 


number  of  nodes 


Fig.  2.  Speedups  for  global  equivulencing  on  the 
Symult  S2010. 
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Abstract  dimensional  magnetism.  Magnetic  properties  are  be- 


A  large  scale  Quantum  Monte  li’arlo  simulation 
is  performed  on  the  Mark  lllf|i  II  cpcrcube  super¬ 
computer  to  systematically  study  ti  c  <iuantum  spin 
dynamics  of  the  recently  discovered  liigh-T",.  super¬ 
conducting  mother  material.  The  algorithm  is  very 
efficiently  implemented  on  the  llyi  ercube.  The  3- 
dimensional  lattice  is  partitioned  into  a  ring  of  pro¬ 
cessor  nodes.  Parallelism  is  also  a'diieved  by  run¬ 
ning  several  independent  sinudaiions  on  several  pro¬ 
cessor  rings  simultaneously.  The  local  updates  are 
easily  handled  by  the  CllOS  communication  sys¬ 
tem.  Global  updates  are  efficiently  implemented  by 
a  “gather  -  scatter”  routines  written  in  CllOS  calls. 
Spins  are  packed  into  32-hit  words  along  the  time 
direction.  Local  updates  nr<;  vectorised  along  time 
direction.  We  also  report  a  s\steinatic  |>crformance 
analysis.  The  efficiency  of  t  he  impler  .ientat  ion  is  over 
90%. 

1.  Introduction 

The  power  of  paralli.d  computers  is  doubling 
each  six  month  in  recent  years,  .‘-igirificant  com- 
putations[l]  in  scientific  and  (.uigiue.'!  ing  researches 
are  performed  on  these  [larallel  coui|>uters.  Many  of 
the  applications[2]  easily  achior<’d  peiformance  bet¬ 
ter  (in  some  cases,  much  better)  than  the  conven¬ 
tional  supercomputers  and  obtained  new,  important 
results.  In  this  paper,  we  report  sucli  an  application 
on  the  Marklllfp  hypercubcpj.  l]. 

The  discovery  of  higli-7).  supercoiuluctors[.5] 
has  led  to  enormous  amount  of  research  on  the  two 


lieved  to  play  significant  role  in  the  new  mechanism 
for  the  high-Tc  superconductivity.  These  magnetism 
is  essentially  modeled  by  the  quantum  antiferromag¬ 
netic  Heisenberg  model: 

H  =  jY^(SfSf  +  SfS^+StS^) 

(ij) 

where  5“  =  (l/2)o’“  are  quantum  spin  operators: 


0 

1 

^  0  -i 

T  1  0 

1 

0 

^  ~  i  0 

=  0  -1 

Tliis  model  is  tdso  an  limiting  case  of  a  much  more 
general  theory.  We  have  performed  a  series  of  large 
scale  quantum  Monte  Carlo  simulations  on  these 


Fig.  1 
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models  and  obtained  significant  results.  We  only 
mention  here  that  our  data  agrees  with  the  neutron 
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scattering  experiments  very  well  and  this  calculation 
is  the  first  to  give  an  accurate  first  principles  determi¬ 
nation  of  the  exchange  coui>ling  J  —  M50±30A'  and 
spin  stiffness  constant  p  =  0.199('2).  Fig.l  shows  the 
comparison  between  our  data  and  tlic  experiments. 
More  details  of  the  physics  results  is  reported  in 
[6].  On  a  related  quantum  XY  mod'd,  we  found[7]  a 
Kosterlitz-Thouless  phase  transitioi  thus  completes 
more  than  20  years  investigation  into  an  important 
problem  in  statistical  physics. 

In  this  paper,  we  will  concent rat<?  on  the  com¬ 
putational  aspects  of  the  simulation.  \Ve  first  out¬ 
line  the  algorithm.  The  multi-coding  technique  is 
explained  next.  Parallel  implement  at  ion  is  then  dis¬ 
cussed  in  detail.  We  give  a  systematic  performance 
analysis. 

2.  Converting  Quantum  Probhuu  to  Classical 
Problem 

The  Monte  Carlo  algorithm  w?  us(.'rl  is  a  fairly 
standard  one  in  statistical  physics,  ilthough  the  de¬ 
tails  are  quite  complex  diK'  to  the  <iuantum  conser¬ 
vation  laws.  Following  the  Suzuki-',  rotter  approach 
we  first  convert  the  quantum  prohlei.i  into  a  cla.ssical 
one.  The  partition  function  for  the  Heisenberg  model 
Eq.l  on  the  2-  dimensional  hiltice  can  be  rewritten 
as; 

m— *00 

where  0  =  \/kT  atul 

II  =  //i  -t-  II-,  +  Ih  +  //.I 

is  a  breakup  so  that  each  //,  contains  oidy  terms 
commuting  among  themseb cs.  '1  he  integer,  m,  is 
set  to  be  a  large  but  finite  nnniber,  i  i  practice.  After 
inserting  complete  si.Ms  of  >.i;it<'.s  (ei"en-states  of  SI), 
the  partition  function  bre.iks  down  hito  products  of 
Boltzmann  factors  assocititi'd  with  ii.ic'iacting  d-spin 
squares; 

Z=  lim  V(l,  l|c-'’"'/"'|l,2)(l,2|e-'^'^-'/'"|l,3) 

fn-«c»o 

(C) 


(l,3|e-^">/'”|l,4>(l,4|c-^"‘/"’|2, 1)  ■  •  • 

(m,  1  |m,  2>(m,  2|c-^">/'"  |m,  3) 

{m,3|e-^">/'"|m,4)(m,4|e-^"*/’"|l,l) 

This  becomes  a  geiierril  classical  Ising  spin  system 
in  3  dimensions.  The  Boltzmann  weight,  associated 
with  a  4-spin  square  configuration  is  given  by  the 
following  transfer  matrix; 

more  explicitly, 

0  0  0 

0  e-^ch(2K)  e-^sh{2K)  0 

0  e-^shi2K)  e-^ch\2K)  0 

0  0  0 

where  K  =  0lAm.  The  zero  elements  in  transfer 
matrix  are  the  consequence  of  the  quantum  conser¬ 
vation  law.  To  avoid  generating  trial  configuration 
with  these  zero  transfer  probability  thus  wasting  the 
CPU  time  because  these  trials  will  never  be  accepted, 
one  should  have  the  conservation  law  built  into  the 
flipping  scheme.  We  have  designed  a  set  of  four  el¬ 
ementary  updates[6]  that  can  generate  all  possible 
spin  configurations. 

3.  The  Computational  Algorithm 


This  classical  spin  system  in  3  dimensions  is 
simulated  using  the  Metropolis  Monte  Carlo  algo¬ 
rithm.  Starting  with  a  given  initial  configuration, 
we  locate  a  closed  loop  C  of  L  spins.  After  check 
that  they  satisfy  the  conservation  law  if  we  flip  them 
all  together,  we  compute  the  probability  that  the 
present  configuration  remains  unchanged; 

-  “t=l  ^Ck.Ck 

where  diagonal  elements  of  the  trans¬ 

fer  matrix,  and  the  probability  that  the  configuration 
are  flipped; 


p,  =  nJ:f  ivSlc,., 


where  diagonal  elements  of  the 

transfer  matrix  along  the  loop  C.  The  Metropolis 
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procedure  is  to  accept  tlie  flip  acconliiig  to  the  prob¬ 
ability 

P=P.f/Pi. 

If  the  flip  is  not  accepted,  we  keep  the  initial  config¬ 
uration  and  go  on  to  the  next  looji  of  spins. 

The  classical  system  is  defined  on  a  3-  dimen¬ 
sional  lattice.  On  each  grid  point,  i  spin  can  only 
have  two  states,  up  or  down,  whicli  is  represented  by 
0  or  1.  An  elementary  square  of  4  spins  is  called  an 
interacting  square  if  they  are  conne<  led  through  the 
Boltzmann  factor.  They  arc  denoted  Ijy  the  shaded 
square  in  Fig. 2.  Note  that  not  all  squares  are  inter¬ 
acting. 


t 


Fig.2 

Two  types  of  local  moves  may  locally  change 
the  spin  configurations.  The  timc-loo|)  local  update 
is  shown  in  Fig. 3.  (Note  that  the  sjuiis  at  (he  lattice 
sites  in  Figs.  3-5  are  omitted  for  simplicity.  Their 
presence  are  same  as  in  Fig.2.)  Ail  8  spins  in  the 
loop  are  either  all  flipped,  or  remains  unchanged  de¬ 
pending  on  whether  the  ijrobaliilil  test  is  success 
or  failure.  Similarly  in  the  space-lo  )p  local  update, 
shown  in  Fig. 4,  all  4  spins  in  the  loo))  arc  either  all 
flipped  or  not. 

A  global  move  in  the  time  direction,  we  called  it 
time-line,  flips  all  the  spins  along  (his  lime-line.  This 
update  changes  the  magnelizalion.  Another  global 


move  in  spatial  directions,  winding-line,  as  shown 
in  Fig.  2  by  the  double  line,  changes  the  winding 
numbers. 


Fig.3 


Fig  .4 

Periodic  boundary  conditions  are  imposed  in 
all  directions  to  preserve  the  translation  invariance 
and  to  satisfy  the  trace  requirement.  In  each  Monte 
Carlo  sweep  through  the  lattice,  we  apply  the  all  four 
moves  to  all  possible  configuration.  After  enough 
sweeps,  the  system  reaches  the  equilibrium  state,  and 
we  then  take  meaisurements. 
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4.  The  multispill  Coding 

We  implemented  a  simple  anc;  efficient  multi¬ 
spin  coding  method,  which  facilitat  es  vectorization, 
saves  index  calculation  and  ineniorp  space.  This  is 
possible  because  each  si>in  only  has  two  states,  up  (1) 
or  down  (0),  which  is  represented  by  a  single  bit  in  a 
32-bit  integer.  Spins  along  t-  direction  is  packed  into 
a  32-bit  words,  so  that  the  boundary  communication 
along  X  or  y  direction  can  bo  handled  more  easily. 

All  the  necessary  checks  and  updates  can  be 
handled  by  the  bitwise  logical  operations  OR,  AND, 
NOT,  XOR.  Note  that  this  is  a  natn-al  vectorization 
since  AND  operations  for  the  32  s.jiiis  are  carried 
out  in  the  single  AND  operation  by  the  CPU.  The 
index  calculations  to  addre.ss  these  iiulividual  spins 
are  also  minimized,  because  one  only  computes  the 
index  once  for  the  32  spins.  The  saine  principles  are 
applied  for  both  local  and  global  iiove.s,  but  it  is 
easier  to  illustrate  them  for  local  moves,  as  shown  in 
Fig-5. 

A  pair  of  adjacent  words  contains  eight  “time” 
loops,  as  indicated  by  the  doited  line  in  Pig. 5.  Be¬ 
cause  every  two  adjacent  “time”  loops  share  an  inter¬ 
acting  square,  we  update  all  four  odd  “time”  loops 
simultaneously  in  a  vectorized  fashion.  The  otlier 
four  even  “time”  loops  are  updated  ne.xt.  Many  of 
the  useful  quantities  obtained  in  u;  dating  the  four 
odd  ones  will  also  be  used  for  the  four  even  ones.  We 
now  briefly  illustrate  the  scheme.  Wc'  want  to  update 
the  odd  “time”  loops  1,  3,  5  and  7  ef  the  spin  words 

51  and  S2  in  Fig.-fi.  We  first  rompnle  F  =  SI  [XOR] 

52  ,  and  then  W  =  F  [AND]  .MASK],  where  MASKl 
has  “F’s  located  at  the  pro|>er  position  of  the  “time” 
loop:  MASK1=(0- ■  01 1 1 1 ).  'I'lie  (lip  of  “time”  loop 
1  is  allowed  if  \V  [AND]  M.ASKl  =  M.ASK'l  and  (SI 
[AND]  MASKl)  -b  {S2  [AND]  .MASKl)  =  IG  (which 
means  that  all  four  spins  in  SI  must  lx,'  down  and 
the  four  spins  in  S2  must  be:  up,  or  vice  versa).  SI 
is  also  XOR-ed  with  Nl,  Ntl  and  N-G  to  obtain  El, 
E6  and  E5,  the  information  nei'ded  to  compute  the 
energy  due  to  the  three  interacting  loops  on  Si  side 
(see  Fig.3).  Similarly,  S2  is  XOll-ed  villi  N2,  N3  and 


N4.  Finally,  Si  is  XOR-ed  with  (SI  [right-shift]  1)  to 
obtain  C  which  contains  the  information  about  the 
upper  and  lower  interacting  loops  (which  are  shared 
with  adjacent  “time”  loops).  After  masking  N1-N6 
and  C  with  appropriate  masks,  we  SHIFT,  OR  them 
together,  to  obtain  XI  and  X2  which  contain  the  in¬ 
formation  about  the  eight  interacting  loops  shared 
by  Si  and  S2.  Notice  that  N1-N6  are  used  only  once 
for  all  of  the  eight  “time”  loops. 


NS  N4 


Fig.5 

To  retrieve  the  information  that  pertains  to  the 
“time”  loop  1,  we  calculate  II  =  XI  [AND]  MASKl 
and  12  =  X2  [AND]  MASKl.  (11,12)  is  a  pair  of 
small  integers  in  one-to-one  correspondence  with  the 
spin  configuration:  it  uniquely  determines  the  tran¬ 
sition  probability.  Thus  (11,12)  is  used  as  an  index 
to  fetch  the  transition  probability  stored  in  a  small 
lookup  table  calculated  at  the  beginning.  The  float¬ 
ing  point  operations  in  the  update  are  the  Metropolis 
accept/reject  test.  Upon  acceptance,  the  proper  four 
spins  in  SI  are  flipped  by  SI  =  SI  [XOR]  MASKl, 
and  similarly  for  S2.  The  update  of  the  “time”  loop  3 
proceeds  in  the  same  manner  as  for  loop  1 ,  after  left- 
shifting  MASKl  for  8  bits,  and  similarly  for  loops 
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5  and  7.  Once  “time”  loops  1,3, 5, 7  are  completed, 
we  need  to  recalculate  C  only,  and  the  entire  process 
is  repeated  for  even  loops.  Notice  that  only  floating 
point  operation  in  these  updato.s  is  a  random  number 
generation[9]  and  comparison. 

Four  adjacent  words  contain  •  ight  “space”  lo 
-ops.  They  can  be  updated  without  alternating  even 
and  odd  ones,  since  they  are  docoui  led. 

The  global  move  in  time  direciion  is  very  easy 
to  implement  with  tliis  type  of  spin  packing.  One  has 
to  check  whether  bits  are  all  either  O’s  or  Ts,  then  to 
XOR  the  word  to  be  flipped  with  lour  neighboring 
words  to  get  the  transition  probability.  The  same 
principles  are  used  to  implement  the  global  flip  in 
spatial  directions,  but  the  actual  procedure  is  much 
more  complicated.  It  is  desiraljle  to  have  the  sim¬ 
plest  possible  spin  interaction  in  order  to  minimize 
the  complexity  of  the  various  tests  needed  to  de¬ 
termine  the  transition  probability.  For  this  reason, 
we  believe  that  our  “bond-type”  decomposition  is 
preferable  due  to  the  simplicity  of  spin  interactions, 
although  the  spin  packing  could  b<  done  with  any 
other  decomposition,  such  ns  “cell-type”  breakup, 
which  leads  to  more  complicalt'd  S-spiii  interactions. 

5.  Pcu'allel  Implementation 

Using  the  multicoding  tochnir|uc,  the  sjrins  do 
not  occupy  large  memory  spaces,  the  most  simply 
way  is  to  run  an  independent  simulation  on  each 
node  and  average  them  to  get  the  statistical  results. 
However,  this  naive  im|>lcm<'ntation  does  not  work 
well  for  this  problem.  Our  latlici  s  is  fairly  large 
(128x128x192)  in  the  sense  that  a  tynical  thermal  re- 
leixation  take  about  10,000  sweeps  ;d('tails  tiepends 
on  the  temperature  and  the  cone  at  ion  length  of 
the  system).  Both  thermal  and  (j.iantiim  fluctua¬ 
tion  make  the  averaging  proce.ss  \c  y  long,  in  units 
of  100  hours,  to  obtain  a  snflicient  complete  sample. 
We  need  to  do  physical  spac<'  i>ara  leiization  to  cut 
this  CPU  time  down  to  more  reasonable  time. 

We  partition  the  3-  rlimensional  lattice  Nx*Ny 
*Nt  into  a  ring  of  M  processor  nodes  so  that  each 


node  contain  a  subspace  (Nx/M)*Ny*Nt,  shown  in 
Fig. 6.  The  local  updates  are  easily  parallelized  since 
the  connection  is  at  most  next-nearest  neighbor  (for 
the  time-loop  update).  The  needed  spin-word  ar¬ 
rays  from  its  neighbor  are  copied  into  the  the  local 
storage  by  the  shift  routine  in  CROS  communica¬ 
tion  system[3,4]  before  doing  the  update.  One  of  the 
global  update,  the  time-line,  can  also  be  done  in  the 
same  fashion.  The  communication  is  very  efficient  in 
the  sence  that  a  single  communication  shift  Ny*Nt 
spins  instead  of  Nt  spins  in  the  case  the  lattice  is 
partitioned  into  2-  dimensional  grid.  The  overhead 
associated  with  the  communication  routine,  which 
is  quite  non-negligible,  is  reduced  greatly  because  it 
only  occurs  once  ,  instead  of  Ny  times.  This  is  one  of 
the  reasons  that  a  1-d  grid  decomposition  is  bether 
than  2-d  decomposition  for  this  class  of  problems. 


PRCX^ESSOR  MOOES 


Fig.6 

The  winding-line  global  update  along  x-  direc¬ 
tion  is  difficult  to  do  in  this  fashion,  because  it  in¬ 
volves  spins  on  all  the  M  nodes.  In  addition.  We  need 
to  compute  the  correlation  functions  which  have  the 
same  difficulty.  However,  since  these  operations  are 
not  used  very  often  (every  10  sweep,  one  may  call 
the  winding-line  update  and  the  correlation  function 
measurements),  we  devised  a  fairly  elegant  way  to 
parallelize  these  global  operations. 

We  have  wrote  a  set  of  gather-scatter  routines 
based  on  the  cread  and  cwrite  in  CROS.  In  gather. 
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the  subspaces  on  each  node  are  gat'iered  into  com¬ 
plete  spaces  on  a  eacli  node,  ]>reser\ing  tlie  original 
geometric  connection.  As  .shown  in  Fig. 7,  each  node 
has  a  copy,  but  the  x-coordinate  is  rotated  accord¬ 
ingly  so  the  node’s  own  subs|iiice  is  in  tlie  starting 
position.  Parallelism  is  achieved  now  .since  the  global 
operations  are  done  on  each  notle  jnst  as  in  the  se¬ 
quential  computer,  with  each  node  only  do  the  patrt  it 
originally  covers.  In  scatter,  the  updated  (changed) 
lattice  configuration  on  a  particnla.  node  (number 
zero)  is  scattered  (distril)uted)  bach  to  all  the  node 
in  the  ring,  exactly  according  the  Ciginal  partition. 
Note  that  this  scheme  difl'ers  from  the  earlier  decom¬ 
position  scheme[8]  for  the  gravitatioi'  problem,  where 
memory  size  constraints  is  the  main  concern. 


Fig.7 

All  of  our  simulations  were  done  on  a  32-node 
hypercube.  For  higher  tcmiperatn. es,  no  need  to 
use  lattice  of  sizes  of  9().\!)r),  or  not  even  Gd-xG-l.  So 
the  32  nodes  was  divided  into  sevnal  independent 


rings,  each  ring  holds  an  independent  simulation,  as 
shown  in  Fig. 6.  Typically,  for  32x32  lattice,  we  run 
8  simulations,  each  using  4  node-ring.  For  96x96  lat¬ 
tice,  we  run  2  independent  runs,  e^tch  uses  16  node 
rings.  This  simple  parallelism  make  the  simulation 
very  flexible  and  efficient.  In  the  simulation,  we  used 
a  parallel  version  of  the  Fibonacci  additive  random 
numbers  generator[9]  which  has  a  period  larger  that 
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6.  Performance  Analysis 

VVe  have  made  a  systematic  performance  anal¬ 
ysis,  by  running  the  code  on  different  sizes  and  dif¬ 
ferent  number  of  nodes.  The  timing  results  for  a 
realistic  situation  (20  sweeps  of  update,  1  measure¬ 
ment)  are  given  in  Table  1.  The  speedup, 


Fig.8 

where  <j  (Im)  is  the  time  for  the  same  size  spins 
system  to  run  same  number  operations  on  1  (M) 
nodes,  are  listed  in  Table  1.  It  is  also  plotted  in 
Fig.8.  One  can  see  that  speedup  is  quite  close  to  the 
ideal  case  denoted  by  the  dashed  line  in  Fig.8  For 
the  128x128  quantum  spin  system,  the  32-node  hy¬ 
percube  speedup  the  computation  by  a  factor  of  26.6, 
a  very  good  result.  However,  running  the  same  spin 
system  on  16-node  is  more  efficient,  because  we  can 
run  two  independent  systems  on  the  32-node  hyper¬ 
cube  with  a  total  speedup  2x14.5=29  (each  speeds  up 
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a  factor  14.5).  This  is  better  dcscrib  -cl  by  efficiency, 
defined  as  speedup/nodes,  which  is  tabulated  in  Ta¬ 
ble  1.  Clearly,  the  efficiency  of  the  implementation 
is  very  high,  generally  over  90%. 

The  Exact  direct  compaii.son  with  other  super¬ 
computers  are  not  available  at  present.  However, 
an  very  similar  multispin  spin  code  [10]  in  calculat¬ 
ing  the  elementary  excitation  energy  sjicctrum  of  this 
same  Heisenberg  model  is  rnnnuig  on  both  Marklllfp 
and  on  Cray  XMP.  The  Cray  speed  is  approximately 
equivalent  to  2-node  Marklllfp.  This  indicates  that 
our  32-node  Marklllfp  performs  better  than  Cray 
XMP  about  a  factor  of  (32/2) *90%  =  14!  We  note 
that  our  code  is  written  in  ‘‘C”  and  (he  vectorization 
is  limited  to  the  32-bit  inside  tlie  v;  jrds.  Rewriting 
the  code  in  Fortran  (Fortran  compilers  on  Cray  are 
more  efficient)  and  fidly  vectorize  tin;  code,  one  may 
gain  a  factor  about  5  on  Cray.  But  even  after  such 
an  big  programming  efforts  on  the  Cray,  Marklllfp 
will  probably  run  faster  than  Cray  XMP  by  a  factor 
of  3.  Clearly  this  quantum  Monte  carlo  code  is  a 


good  example  in  that  parallel  computers  easily  (i.e., 
at  same  programming  level)  outperform  the  conven¬ 
tional  supercomputers. 

7.  Conclusion  and  Acknowlegements 

An  implementation  of  the  quantum  Monte 
Carlo  code  for  the  spin  system  on  the  hypercube 
is  described  in  detail.  The  multicoding  technique 
is  a  efficient,  vectorized  and  memory  saving  scheme. 
Ring  decomposition  and  the  gather-scatter  make  the 
parallel  version  flexible  and  efficient.  The  perfor¬ 
mance  is  very  good.  Efficiency  is  over  90%.  This 
parallel  technique  can  be  applied  to  a  general  class 
of  not-so-memory-restrainted  problems.  This  paral¬ 
lel  code  on  Marklllfp  outperforms  the  conventional 
supercomputers. 

We  thank  M.  Cross,  G.  Fox  and  P.  Weichman 
for  valuable  discussions.  This  work  is  supported  by 
DOE  DE-  GF03-  85ER  25009  and  NSF  DMR-87 
15474.  MSM  thanks  the  Shell  Foundation  for  a  fel¬ 
lowship. 


Table  1.  Performance  of  Marklllfp  for  the  Quantum  Spin 
program.  The  timing  (in  seconds)  for  update  20  sweeps 
and  1  measurement,  ihe  speedup  and  the  efficiency. 


Size 

32 

16 

8 

4 

2 

1 

128x128 

time 

20.7 

38.1 

74.1 

145.4 

298 

551 

128x128 

speedup 

26.6 

14.5 

7.44 

3.79 

1.85 

1 

128x128 

efficiency 

.832 

.904 

.930 

.948 

.925 

1 

96x96 

time 

— 

21.3 

41.3 

80.2 

160 

310 

96x96 

speedup 

— 

14.5 

7.50 

3.86 

1.94 

1 

96x96 

efficiency 

—  — 

.909 

.937 

.965 

.968 

1 

64x64 

time 

9.86 

18.4 

35.5 

69.8 

139 

64x64 

speedup 

— 

14.1 

7.55 

3.91 

1.99 

1 

64x64 

efficiency 

■■ 

.881 

.944 

.979 

.996 

1 
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Abstract 

Numerical  simulations  of  Lattice  QCD  have  been 
performed  on  practically  every  computer,  since  its 
inception  almost  twenty  years  ago.  Lattice  QCD 
is  an  ideal  problem  for  parallel  machines  2ts  it  can 
be  easily  domsun  decomposed.  In  fact,  the  urge  to 
simulate  QCD  has  led  to  the  development  of  several 
home-grown  parallel  “QCD  machines”,  in  particular 
the  Caltech  Cosmic  Cube,  the  Columbia  Machine, 
IBM’s  GFll,  APE  in  Rome  and  the  Fermilab  Ma¬ 
chine.  These  machines  were  built  because,  at  the 
time,  there  were  no  commercial  parallel  comput¬ 
ers  fast  enough.  Today  however  the  situation  has 
changed  with  the  advent  of  computers  like  the  Con¬ 
nection  Machine  2  and  the  Ncube  2.  Herein,  I  shall 
explain  why  Lattice  QCD  is  such  a  parallel  prob¬ 
lem  and  compare  two  large-scale  simulations  of  it  - 
one  on  the  commercial  Connection  Machine  and  the 
other  on  the  latest  Caltech/JPL  hypercube. 

1.  Introduction 

Quantum  Chromo-dynamics  (QCD)  simulations 
are  consuming  vast  amounts  of  computer  time  these 
days,  and  promise  to  do  so  for  at  least  the  foresee¬ 
able  future.  The  background  for  these  calculations 
is  two  decades  of  great  progress  in  our  understand¬ 
ing  of  the  basic  particles  and  forces.  Over  time, 
the  particle  physics  community  has  developed  an 
elegant  and  satisfying  theory  which  is  believed  to 
describe  all  the  particles  and  forces  which  can  be 
produced  in  today’s  high  energy  accelerators.  The 
basic  components  of  the  so-called  “Standard  Model” 
are  matter  particles  (quarks  and  leptons),  and  the 
forces  through  which  they  interact  (electromagnetic, 
weak  and  strong).  The  electromagnetic  force  is  the 
most  familiar,  and  also  the  first  to  be  understood 
in  detail.  The  weak  force  is  less  familiar,  but  man¬ 
ifests  itself  in  processes  such  as  nuclear  beta-decay. 


for  example.  This  piece  of  the  Standard  Model  is 
now  called  the  electroweak  sector.  The  third  part 
of  the  Standcird  Model  is  the  QCD,  the  theory  of 
the  strong  force,  which  binds  quarks  together  into 
“hadrons”,  such  as  protons,  neutrons,  pions,  and 
a  host  of  other  particles.  The  strong  force  is  also 
responsible  for  the  fact  that  protons  and  neutrons 
bind  together  to  form  the  atomic  nucleus.  Currently 
we  know  of  five  types  of  quark  (referred  to  as  “flar 
vors”):  up,  down,  strange,  charm  and  bottom;  and 
expect  at  least  one  more  (top)  to  show  up  soon. 
In  addition  to  having  a  “flavor”,  quarks  can  carry 
one  of  three  possible  charges  known  as  “color”  (this 
has  nothing  to  do  with  color  in  the  macroscopic 
world!),  hence  Quantum  Chromo-dynamics.  The 
strong  “color”  force  is  mediated  by  particles  called 
gluons,  just  as  photons  mediate  light  in  electromag¬ 
netism.  Unlike  photons,  though,  gluons  themselves 
carry  a  “color”  charge  and  therefore  interact  with 
one  another.  This  means  that  QCD  is  an  extremely 
nonlinear  theory  which  cannot  be  solved  analyti¬ 
cally.  Hence  we  resort  to  numericfd  simulations. 

2.  Lattice  QCD 

To  put  QCD  on  a  computer  we  proceed  as  fol¬ 
lows.  The  four-dimensional  space-time  continuum 
is  replaced  by  a  four-dimensional  hypercubic  peri¬ 
odic  lattice,  of  size  N  =  N^x  N,x  N,x  Nt  with  the 
quarks  living  on  the  sites  and  the  gluons  living  on 
the  links  of  the  lattice.  N,  is  the  spatial  and  Nt  is 
the  temporal  extent  of  the  lattice.  The  gluons  are 
represented  by  3x3  complex  517(3)  matrices  associ¬ 
ated  with  each  link  in  the  lattice.  This  link  matrix 
describes  how  the  “color”  of  a  quark  changes  as  it 
moves  &om  one  site  to  the  next.  The  action  func¬ 
tional  for  the  purely  gluonic  part  of  QCD  is 

SG=pY,^l-^ReTrUp),  (1) 
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where 


Vp  =  (2) 

is  the  product  of  link  matrices  around  an  elemen¬ 
tary  square  or  plaquette  on  the  lattice  -  see  Figure 
1.  Essentially  all  of  the  time  in  QCD  simulations 
of  gluons  only  is  spent  multiplying  these  SV (3)  mar 
trices  together.  The  code  for  this,  shown  in  the 
Appendix,  reveals  that  its  main  component  is  the 
a  X  b  c  kernel  (which  most  supercomputers  can 
do  very  efficiently).  The  partition  function  for  full 
lattice  QCD  including  quarks  is 

^  —  J  DrpDrpDU  exp(-S'G  -  +  m)V>),  (3) 

where  ^  +  m  is  a  large  sparse  matrix  the  size  of 
the  lattice  squcired.  Unfortunately,  since  the  quark 
variables  rp  are  anticommuting  Grassmann  numbers, 
there  is  no  simple  representation  for  them  on  the 
computer.  Instead  they  must  be  integrated  out, 
leaving  a  highly  non-local  fermion  determinant: 

^  —  J  det(^)  -(-  m)  exp(-5G),  (4) 

This  is  the  basic  integral  one  wants  to  evaluate  nu¬ 
merically. 

Note  that  the  lattice  is  a  mathematical  con¬ 
struct  used  to  solve  the  theory — ^at  the  end  of  the 
day,  the  lattice  spacing  a  must  be  taken  to  zero  to 
get  back  to  the  continuum  limit.  The  lattice  spac¬ 
ing  itself  does  not  show  up  explicitly  in  the  partition 
function  Z  above.  Instead  the  parameter  0  =  6/y^, 
which  plays  the  role  of  an  inverse  temperature,  ends 
up  controlling  the  lattice  spacing  a(/3).  To  take  the 
continuum  limit  o  — ♦  0  of  lattice  (^CD  one  tunes 
g  —*  0,  OT  P  —*  oo.  Typical  values  used  in  simula¬ 
tions  these  days  range  from/3  =  5.3  to/3  =  6.0.  This 
corresponds  to  a  .1  Fermi  =  10~'®  meter.  Thus 
at  current  values  of  /3  a  lattice  with  N,  =  20  will 
correspond  to  a  physical  box  about  2  Fermi  on  an 
edge,  which  is  large  enough  to  hold  one  proton  with¬ 
out  crushing  it  too  much  in  the  finite  volume.  Still 
the  spacing  a  =  .1  Fermi  is  not  fine  enough  that  we 
are  close  to  the  continuum  limit.  One  can  estimate 
that  we  still  need  to  shrink  the  lattice  spacing  by 
something  like  a  factor  of  4,  leading  to  an  increase 
of  a  factor  4^  in  the  number  of  points  in  the  lattice 
in  order  to  keep  the  box  the  same  physical  volume. 

The  biggest  stumbling  block  preventing  a  large 
increase  in  the  number  of  lattice  points  is  the  pres¬ 
ence  of  the  determinant  det{lp  -f  m)  in  the  parti¬ 
tion  function.  Physically,  this  determinant  arises 
from  closed  quark  loops.  The  simpl?st  way  to  pro¬ 
ceed  is  to  ignore  these  quark  loops  and  work  in  the 


so-called  “quenched”  or  “pure  gauge”  approxima¬ 
tion.  The  quenched  approximation  assumes  that  the 
whole  effect  of  quarks  on  gluons  can  be  absorbed  in 
a  redefinition  of  the  gluon  interaction  strength.  Op¬ 
erationally,  one  generates  gluon  held  configurations 
using  only  the  pure  gauge  part  of  the  action,  and 
then  computes  the  observables  of  interest  in  those 
backgrounds.  For  some  quantities  this  may  be  a 
reasonable  approximation.  It  is  certainly  orders  of 
magnitude  cheaper,  and  for  this  reason,  most  all 
simulations  to  date  have  been  done  using  it.  To 
investigate  the  fully  realistic  theory,  though,  one 
has  to  go  beyond  the  quenched  approximation  and 
tackle  the  fermion  determinant. 

There  have  been  many  proposals  for  dealing 
with  the  determinant.  The  first  algorithms  tried 
to  compute  the  change  in  the  determinant  when  a 
single  link  variable  was  updated.  This  turned  out  to 
be  prohibitively  expensive.  Today,  the  preferred  ap¬ 
proach  is  the  so-called  “Hybrid  Monte  Carlo”  algo¬ 
rithm  [1]  .  The  basic  idea  is  to  invent  some  dynam¬ 
ics  for  the  variables  in  the  system  in  order  to  evolve 
the  whole  system  forward  in  (simulation)  time  and 
then  do  a  Metropolis  accept/reject  for  the  entire 
evolution  on  the  basis  of  the  total  energy  change. 
The  great  advantage  is  that  the  whole  system  is  up¬ 
dated  at  one  fell  swoop.  The  disadvantage  is  that 
if  the  dynamics  is  not  correct  then  the  acceptance 
will  be  very  small.  Fortunately  (and  this  one  of 
very  few  fortuitous  happenings  where  fermions  are 
concerned)  good  dynamics  can  be  found:  the  Hy¬ 
brid  algorithm  [2]  .  This  is  a  neat  combination 
of  the  deterministic  microcanonical  method  [3]  and 
the  stochastic  Langevin  method  [4]  .  which  yields  a 
quickly-evolving,  ergodic  algorithm  for  both  gauge 
fields  and  fermions.  The  computational  kernel  of 
this  algorithm  is  the  repeated  solution  of  systems  of 
equations  of  the  form 

{]p  +  m)<f>  =  T],  (5) 

where  <p  and  q  are  vectors  which  live  on  the  sites  of 
the  lattice.  To  solve  these  equations  one  typically 
uses  conjugate  gradient  or  one  of  its  cousins,  since 
the  fermion  matrix  {Ip  m)  is  sparse.  For  more 
details,  see  [5]  .  Such  iterative  matrix  algorithms 
have  as  their  basic  component  the  ax  b  +  c  kernel, 
so  again  computers  which  do  this  efficiently  will  run 
QCD  both  with  and  without  fermions  well. 

However  one  generates  the  gauge  configurations 
U,  using  the  quenched  approximation  or  not,  one 
then  has  to  compute  the  observables  of  interest.  For 
observables  involving  quarks  one  runs  into  expres¬ 
sions  like  (^(z)^(y))  involving  pairs  of  quark  fields 
at  different  points.  Again  because  of  the  Grassmann 
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nature  of  fermions  fields,  one  has  to  express  this 
qu2mtity  as 

{^|){x)^p{y))  =  {I^  +  m)-^ .  (6) 

And  again  one  computes  as  many  columns  of  the 
inverse  as  needed  by  solving  systems  equations  like 
(5)  above.  For  simulations  of  full  QCD  with  quark 
loops,  this  phase  of  the  calculation  Is  a  small  over¬ 
head,  while  for  quenched  calculations  it  is  the  dom¬ 
inant  part.  So  whether  quenched  or  not,  most  of 
the  computer  time  is  spent  in  applying  conjugate 
gradient  to  solve  Icirge  systems  of  linear  equations. 

3.  Home-grown  QCD  Machines 

Today  the  biggest  resources  of  computer  time 
for  research  are  the  conventional  supercomputers  at 
the  NSF  and  DOE  centers.  The  centers  are  con¬ 
tinually  expanding  their  support  for  lattice  gauge 
theory,  but  it  may  not  be  long  before  they  are  over¬ 
taken  by  several  dedicated  efforts  involving  con¬ 
current  computers.  It  is  a  revealing  fact  that 
the  development  of  most  high  performance  par¬ 
allel  computers — the  Caltech  Cosmic  Cube,  the 
Columbia  Machine,  IBM’s  GFll,  APE  in  Rome,  the 
Fermilab  Machine — was  actually  motivated  by  the 
desire  to  simulate  lattice  QCD. 

Geoffrey  Fox  and  Chuck  Seitz  at  Caltech  built 
the  first  hypercube  computer,  the  Cosmic  Cube  or 
Mark  I,  in  1983  [6]  .  It  h2id  64  nodes,  each  of  which 
was  an  Intel  8086/87  microprocessor  with  128  KB  of 
memory,  giving  a  total  of  about  2  Mfiops  (measured 
for  QCD).  This  was  quickly  upgraded  to  the  Mark 
II  hypercube  with  faster  chips,  twice  the  memory 
per  node  and  twice  the  number  of  nodes  in  1984  [7] 
.  Now  QCD  is  running  at  600  Mfiops  sustained  on 
the  latest  Caltech  hypercube:  the  128-node  Mark 
Illfp  (built  by  JPL)  [8]  .  Each  node  of  the  Mark 
lllfp  hypercube  contains  two  Motorola  68020  mi¬ 
croprocessors,  one  for  communication  and  the  other 
for  calculation,  with  the  latter  supplemented  by  one 
68881  coprocessor  cind  a  32-bit  Weitek  floating  point 
processor. 

Norman  Christ  and  Tony  Terrano  at  Columbia 
built  their  first  parallel  computer  for  doing  lattice 
QCD  calculations  in  1984  [9]  .  It  had  16  nodes, 
each  of  which  was  an  Intel  80286/87  microproces¬ 
sor  plus  a  TRW  22-bit  floating  point  processor  with 
1  MB  of  memory,  giving  a  total  peak  performance 
of  256  Mfiops.  This  was  improved  in  1987  using 
Weitek  rather  than  TRW  chips  so  that  64  nodes  give 
1  Gfiops  peak  [10]  .  Very  recently,  Columbia  have 
finished  building  their  third  machine:  a  256-node  16 
Gfiops  lattice  QCD  computer  [11]  . 


Don  Weingarten  at  IBM  has  been  building  the 
GFll  since  1984 — it  is  expected  he  will  stut  run¬ 
ning  in  production  in  1990  [12]  .  The  GFll  is  an 
SIMD  machine  comprising  576  Weitek  floating  point 
processors,  each  performing  at  20  Mfiops  to  give  the 
total  11  Gflops  peak  implied  by  the  name. 

The  APE  (Array  Processor  with  Emulator) 
computer  is  basically  a  collection  of  308 1/E  pro¬ 
cessors  (which  were  developed  by  CERN  and  SLAG 
for  use  in  high  energy  experimental  physics)  with 
Weitek  floating  point  processors  attached  [13]  . 
However,  these  floating  point  processors  are  at¬ 
tached  in  a  special  way — each  node  has  four  multi¬ 
pliers  and  four  adders  in  order  to  optimize  complex 
a  X  5 -I- c  calculations,  which  form  the  major  compo¬ 
nent  of  all  lattice  QCD  programs.  This  means  that 
each  node  has  a  peak  performance  of  64  Mflops. 
The  first,  small  m2u:hine — Apetto — ^was  completed 
in  1986  and  had  4  nodes  yielding  a  peak  perfor¬ 
mance  of  256  Mflops.  Currently,  they  have  a  second 
generation  of  this  with  1  Gflops  peak  from  16  nodes. 
By  1992,  the  APE  collaboration  hopes  to  have  com¬ 
pleted  the  100  Gflops  4096-node  “Apecento”  [14]  . 

Not  to  be  outdone,  Fermilab  is  rdso  using  its 
high  energy  experimental  physics  emulators  in  con¬ 
structing  a  lattice  QCD  machine  for  1991  with  256 
of  them  arranged  as  a  2^  hypercube  of  crates,  with 
8  nodes  communicating  through  a  crossbar  in  each 
crate  [15]  .  Altogether  they  expect  to  get  5  Gflops 
peak  performance. 

These  performance  figures  are  summarized  in 
Table  1.  The  “real”  performances  are  the  actual 
performances  obtained  on  QCD  codes;  in  Figure  2 
we  plot  these  as  a  function  of  the  year  the  QCD 
machines  started  to  produce  physics  results.  The 
surprising  fact  is  that  the  rate  of  increase  is  very 
close  to  exponential,  yielding  a  factor  of  ten  every 
two  years!  On  the  same  plot  we  show  our  estimate 
of  the  computer  power  needed  to  redo  this  year’s 
quenched  calculations  on  a  128^  lattice.  This  esti¬ 
mate  is  also  a  function  of  time,  due  to  algorithm 
improvements.  Extrapolating  both  lines,  we  see  the 
outlook  for  lattice  QCD  is  rather  bright.  Reasonable 
results  for  the  “harder”  physical  observables  should 
be  available  within  the  quenched  approximation  in 
the  mid-90’s.  With  the  same  computer  power  we 
will  be  able  to  redo  today’s  quenched  calculations 
using  dynamical  fermions  (but  still  on  today’s  size  of 
lattice).  This  will  tell  us  how  reliable  the  quenched 
approximation  is.  Finally,  results  for  the  full  theory 
with  dynamical  fermions  on  a  128^  lattice  should  fol¬ 
low  early  in  the  next  century  (I),  when  computers 
are  two  or  three  orders  of  magnitude  more  powerful 
agiun. 
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T^ble  1 

Peak  and  real  performances  in  Mflops 
of  “homebrew”  QCD  machines 


Computer 

Year 

Peak 

Real 

Caltech  I 

1983 

3 

2 

Caltech  II 

1984 

9 

6 

Caltech  III 

1989 

2000 

600 

Columbia  I 

1984 

256 

20 

Columbia  II 

1987 

1000 

200 

Columbia  III 

1990 

16000 

6000 

IBM  CFll 

1990 

11000 

10000* 

APE  I 

1986 

256 

20 

APE  II 

1988 

1000 

200 

APE  III 

1992 

100000 

20000* 

Fermilab 

1991 

5000 

1200* 

*  All  real  times  are  measured  except  these  predicted 
ones. 

With  this  brief  review  in  hand,  we  now  turn  to 
a  comparison  of  QCD  running  on  one  home-grown 
computer  -  the  Caltech/JPL  Mark  Illfp  hypercube 
-  with  the  commercially  available  TMC  Connection 
Machine  2. 

4.  QCD  on  the  Caltech/JPL  Mark  Illfp 

Decomposing  QCD  onto  a  d-dimensional  hy¬ 
percube  distributed  memory  computer  (with  2^ 
nodes)  is  particularly  simple.  One  takes  the  N  = 
M2'^  lattice  and  splits  it  up  into  2^  sublattices,  each 
containing  M  sites,  one  of  which  is  placed  in  each 
node.  Due  to  the  locality  of  the  action,  eq.  (2),  it  is 
possible  to  assign  the  sublattices  so  that  each  node 
needs  only  to  communicate  with  others  to  which  it 
is  directly  connected  in  hardware.  As  a  result  of  this 
fact  the  characteristic  timescsJe  of  the  communicar 
tion,  teommi  is  minimtil  and  corresponds  to  roughly 
the  time  taken  to  transfer  a  single  SU{Z)  matrix 
from  one  node  to  its  neighbor.  Conversely  we  can 
characterize  the  calculational  part  of  the  algorithm 
by  a  timescale,  tcaic,  which  is  roughly  the  time  taken 
to  multiply  together  two  SU{3)  matrices.  For  all 
hypercubes  built  without  floating  point  accelerator 
chips  tcomm  «  ^caic  and  hence  QCD  simulations 
are  extremely  “efficient”,  where  efficiency  is  defined 
by  the  relation 


where  Tt  is  the  time  taken  for  k  processors  to  per¬ 
form  the  given  calculation.  Typically  such  calcula¬ 
tions  have  efficiencies  in  the  range  e  >  .90  which 


means  they  are  ideally  suited  to  this  type  of  com¬ 
putation  since  doubling  the  number  of  processors 
approximately  halves  the  total  computational  time 
required  for  solution.  However,  as  we  shall  see, 
the  picture  changes  dramatically  when  fast  floating 
point  chips  are  used;  then  teomm  —  fcoie  and  one 
must  take  some  care  in  coding  to  obtun  maximum 
performance. 

QCD  simulations  have  been  done  on  all  the  Cal¬ 
tech  hypercubes;  the  most  recent  being  a  high  statis¬ 
tics,  large  lattice  study  of  the  string  tension  in  pure 
gauge  QCD  on  the  Mark  Illfp  [16]  .  For  this  the 
128-node  hypercube  performs  at  0.6  Gflops.  As  each 
node  runs  at  6  Mflops  this  corresponds  to  a  speedup 
of  100,  and  hence  an  efficiency  of  78%.  These  figures 
are  for  the  most  highly  optimized  code.  The  original 
version  of  the  code  written  in  C  ran  on  the  Motorola 
chips  at  0.085  Mflops  and  on  the  Weitek  chips  at  1.3 
Mflops.  The  communication  time,  which  is  roughly 
the  same  for  both,  is  less  than  a  2%  overhead  for 
the  former  but  nearly  30%  for  the  latter.  When 
the  computationally  intensive  parts  of  the  calcula¬ 
tion  are  written  in  assembly  code  for  the  Weitek 
this  overhead  becomes  almost  50%.  This  0.9  msec 
of  communication,  shown  in  lines  2  and  3  in  Ihble 
2,  is  dominated  by  the  hardware/software  message 
startup  overhead  (latency),  because  for  the  Mark  Il¬ 
lfp  the  node  to  node  communication  time,  tcomm,  is 
given  by 

(150-^2*17)  Msec, 

where  W  is  the  number  of  words  transmitted.  To 
speed  up  the  communication  we  update  all  even  (or 
odd)  links  (8  in  our  case)  in  each  node,  allowing  us 
to  transfer  8  matrix  products  at  a  time,  instead  of 
just  sending  one  in  each  message.  This  reduces  the 
0.9  msec  by  a  factor  of 

8*(150-H8*2)  _ 

150 -I- 8  *18  *2  “  ■ 

to  0.26  msec.  On  all  hypercubes  with  fast  float¬ 
ing  point  chips  -  and  on  most  hypercubes  without 
for  less  computationally  intensive  codes  -  such  vec- 
torization  of  communication  is  often  important.  In 
Figure  3,  the  speedups  for  many  different  total  lat¬ 
tice  sizes  are  shown.  For  the  largest  lattice  size, 
the  speedup  is  100  on  the  128-node.  The  speedup 
is  almost  linear  in  number  of  nodes.  As  the  to¬ 
tal  lattice  volume  increases,  the  speedup  increases, 
because  the  ratio  of  calculation /communication  in¬ 
creases.  For  more  information  on  this  performance 
analysis,  see  [17]  . 
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5.  QCD  on  the  TMC  Connection  Machine  2 

The  Connection  Machine  is  also  very  well  suited 
for  large-scale  simulations  of  QCD.  The  CM-2  is 
a  distributed-memory,  Single-Instruction  Multiple- 
Data  (SIMD)  massively-parallel  processor  compris¬ 
ing  up  to  65536  (64K)  processors  [18]  .  Each  pro¬ 
cessor  consists  of  an  arithmetic-logic  unit  (ALU), 
8  or  32  Kbytes  of  random-access  memory  (RAM) 
and  a  router  interface  to  perform  communications 
among  the  processors.  There  are  sixteen  processors 
and  a  router  per  custom  VLSI  chip,  with  the  chips 
being  interconnected  as  a  twelve-dimensional  hyper¬ 
cube.  Communications  among  processors  within 
a  chip  work  essentially  like  a  cross-bar  intercon¬ 
nect.  The  router  can  do  general  communications 
but  we  require  only  local  ones  for  QCD  so  we  use 
the  fast  nearest-neighbor  communication  softwrire 
called  NEWS.  The  processors  deal  with  one  bit  at 
a  time,  therefore  the  ALU  can  compute  any  two 
boolean  functions  as  output  from  three  inputs,  and 
all  data  paths  are  1-bit  wide.  In  the  current  version 
of  the  Connection  Machine  (the  CM-2)  groups  of 
32  processors  (two  chips)  share  a  32-bit  (or  64-bit) 
Weitek  floating  point  chip,  and  a  transposer  chip 
which  changes  32  bits  stored  bit-serially  within  32 
processors  into  32  32-bit  words  for  the  Weitek,  and 
vice  versa. 

The  high-level  languages  on  the  CM,  such  as 
*Lisp  and  CM-Fortran,  compile  into  an  assem¬ 
bly  language  called  Paris  (Parallel  Instruction  Set). 
Paris  regards  the  64K  bit-serial  processors  as  the 
fundamental  units  in  the  machine,  and  so  well  rep¬ 
resents  the  global  aspects  of  the  hardware.  However, 
floating  point  computations  are  not  very  efficient  in 
the  Paris  model.  This  is  because  in  Paris  32-bit 
floating  point  numbers  are  stored  “field-wise”,  that 
is,  successive  bits  of  the  word  are  stored  at  succes¬ 
sive  memory  locations  of  each  processors  memory. 
However,  32  processors  share  one  Weitek  chip  which 
deals  with  words  stored  “slice-wise”,  that  is,  stored 
across  the  processors,  one  bit  in  each.  Therefore 
to  do  a  floating  point  operation,  Paris  loads  in  the 
field-wise  operands,  transposes  them  slice-wise  for 
the  Weitek  (using  the  transposer  chip),  does  the  op¬ 
eration  and  transposes  the  slice-wise  result  back  to 
field-wise  for  memory  storage.  Moreover,  every  op¬ 
eration  in  Paris  is  an  atomic  process,  that  is,  two 
operands  are  brought  from  memory  and  one  result 
is  stored  back  to  memory  so  no  use  is  made  of  the 
Weitek  registers  for  intermediate  results.  Hence,  to 
improve  the  performance  of  the  Weiteks,  a  new  as¬ 
sembly  language  called  CMIS  (CM  Instruction  Set) 
has  been  written,  which  models  tht  local  architec¬ 


tural  features  much  better.  In  fact,  CMIS  ignores 
the  bit-serial  processors  and  thinks  of  the  machine  in 
terms  of  the  Weitek  chips.  Thus  data  can  be  stored 
slice-wise,  eliminating  all  the  transposing  back  and 
forth.  CMIS  allows  effective  use  of  the  Weitek  regis¬ 
ters,  creating  a  memory  hierarchy,  which  combined 
with  the  internal  buses  of  the  Weiteks  offers  in¬ 
creased  bandwidth  for  data  motion. 

Currently,  the  Connection  Machine  is  the  most 
powerful  commercial  QCD  machine  available:  the 
“Los  Alamos  collaboration”  is  running  full  QCD  at  a 
sustained  rate  of  almost  2  Gflops  on  a  64K  CM-2  [19] 
.  As  was  the  case  for  the  Mark  Illfp  hypercube,  in 
order  to  obtain  this  performance  one  must  resort  to 
writing  assembly  code  for  the  Weitek  chips  and  for 
the  communication.  Our  original  code,  written  en¬ 
tirely  in  *Lisp,  achieved  around  1  Gflops.  As  shown 
in  Table  3,  this  code  spends  34%  doing  communi¬ 
cation.  When  we  rewrote  the  most  computation¬ 
ally  intensive  part  in  the  assembly  language  CMIS, 
this  rose  to  54%.  In  order  to  obtain  maximum  per¬ 
formance  we  are  now  rewritii.g  the  communication 
part  of  our  code  to  make  use  of  “multi- wire  NEWS” 
which  will  allow  us  to  communicate  in  all  8  direc¬ 
tions  on  the  lattice  simultaneously  thereby  reducing 
the  communication  time  by  a  factor  of  8  and  speed¬ 
ing  up  the  code  by  another  factor  of  2. 

6.  Conclusions 

It  is  interesting  to  note  that  when  the  various 
groups  began  building  their  “homebrew"  QCD  mar 
chines  it  was  clear  that  they  would  out-perform  all 
commercial  (traditional)  supercomputers;  however, 
now  that  commercial  parallel  supercomputers  have 
come  of  age  [20]  the  situation  is  not  so  obvious. 

On  the  original  versions  of  both  commercial 
and  home-grown  parallel  computers  (without  fast 
floating  point  chips)  one  could  get  good  performance 
from  one’s  favorite  high-level  language.  Now,  how¬ 
ever,  as  most  of  these  machines  do  have  fast  float¬ 
ing  point  liardware,  one  must  resort  to  lover-level 
assembly  programming  to  obtain  maximum  perfor¬ 
mance.  Having  done  just  that,  we  are  running  QCD 
at  0.6  Gflops  on  the  Caltech /J  PL  Mark  Illfp  hyper¬ 
cube  and  at  1.65  Gflops  on  the  TMC  Connection 
Machine  2. 
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Table  2 

Link  update  time  (msec)  on  Meurk  Illfp  node 
for  various  levels  of  programming 


Programming  level 

Calc,  time 

Comm,  time 

Total  time 

Mflops 

Motorola  MC68020/68881  in  C 

52 

0.86 

53 

0.085 

Weitek  XL  all  in  C 

2.25 

0.90 

3.15 

1.4 

Weitek  XL  parts  in  Assembly 

0.94 

0.90 

1.84 

2.4 

Weitek  XL  Assembly,  vec.  comm. 

0.94 

0.26 

1.20 

3.8 

Weitek  XL  Assembly,  no  comm. 

0.94 

0.0 

0.94 

4.8 

Tbble  3 

Fermion  update  time  (sec)  on  64K  Connection  Machine 
for  various  levels  of  programming 


Programming  level 

Calc,  time 

Conun.  time 

Total  time 

MSops 

All  in  *Lisp 

8.7 

4.5 

13.2 

900 

Inner  loop  in  CMIS 

3.3 

3.9 

7.2 

Multi-wire  CMIS^ 

<  3.3 

0.5 

<  3.8 

1  projected  numbers 


Fig.  1. 

Illustration  of  plaquette  calculation 
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Abstract 

A  study  is  conducted  of  the  finite  element  solution 
of  the  partial  differential  equations  governing  two- 
dimensional  electromagnetic  field  scattering  problems 
on  a  SIMD  computer.  A  nodal  assembly  technique  is 
introduced  which  maps  a  single  node  to  a  single  proces¬ 
sor.  The  physical  domain  is  first  discretized  in  parallel 
to  yield  the  node  locations  of  an  0-grid  mesh.  Next,  the 
system  of  equations  is  assembled  and  then  solved  in  par¬ 
allel  using  a  conjugate  gradient  algorithm  for  complex¬ 
valued,  non-symmetric,  non-positive  definite  systems. 
Using  this  technique  and  Thinking  Machines  Corpo¬ 
ration’s  Connection  Machine-2  (CM-2),  problems  with 
more  than  250k  nodes  are  solved. 

Results  of  electromagnetic  scattering,  governed  by 
the  2-d  scalar  Helmholtz  wave  equations  are  presented 
for  a  variety  of  infinite  cylinders  and  airfoil  cross- 
sections.  Solutions  are  demonstrated  for  a  wide  range 
of  objects.  A  summary  of  performance  data  is  given  for 
the  set  of  test  problems. 

1  Introduction 

The  finite  element  technique  is  a  method  which  allows 
for  the  approximate  solution  of  partial  differential  equa^ 
tions  over  some  finite  domain.  Because  partial  differen¬ 
tial  equations  govern  various  physical  phenomena,  the 
technique  has  applications  in  many  disciplines.  Here,  a 
study  is  conducted  of  the  finite  element  solution  of  the 
partial  differential  equations  governing  two-dimensional 
electromagnetic  field  scattering  problems  on  a  SIMD 
computer. 


First,  the  weak  form  of  the  continuous  governing 
equations  are  given.  Second,  the  mapping  of  the  fi¬ 
nite  element  program  onto  Thinking  Machines  Corpo¬ 
ration’s  Connection  Machine  using  nodal  assembly  is 
described.  Third,  results  are  presented  for  a  variety  of 
scattering  shapes.  Lastly,  conclusions  are  drawn  and 
future  research  discussed. 

2  Problem  Formulation 

The  equations  of  interest  are  the  2-d  scalar  and  vector 
Helmholtz  wave  equations  [1].  The  equations  are  ap¬ 
plied  over  an  open  region  artificially  truncated  with  an 
absorbing  boundary  condition  [2].  The  scalar  equation 

V  ■—VE,-\-klirE,=Q  (1) 

Ur 

governs  the  transverse  magnetic  (TM)  normal  incident 
cetse  and  the  vector  equation 

V  X  —V  X  H  -  klurU  =  0  (2) 

governs  the  TM  oblique  incident  case.  H  represents 
the  unknown  magnetic  field  and  E,  represents  the  z- 
component  of  the  unknown  electric  field.  Each  case 
can  be  written 

E,  =  E\  +  E\  (3) 

H  =  H'-fH'  (4) 

where  E\  and  H'  represent  the  known  incident  fields 
while  E\  and  H*  are  the  unknown  scattered  fields  (Fig¬ 
ure  1). 
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Figure  1:  Open  region  scattering  problem. 


Applying  the  Galerkin  technique,  the  scalar  equation 
is  written 


<xTEl  -  0 


dTdE: 


0 


dn 


dr 


d<t>  d<i) 


(5) 


where  the  unknown  is  the  scattered  electric  field  eind 
the  vector  equation  is  written 


Ivx  T- Vx  H-/tVT 
n  \fr 


“) 


dfi 


+  ^  [aT  ■  H,„„  -  /3(r  •  V  X  T)(r  •  V  x  H)]  dF 
=  ^  [oT  •  H‘,„  -  /?(r  .  V  X  T)(f  •  V  x  H‘)]  dF 

-  y  T  .  r  X  V  X  H’  dr  (6) 


where  the  unknown  is  the  total  magnetic  field.  In  each 
case  the  Bayliss-Turkel  radiation  condition  has  been  ap¬ 
plied  to  satisfy  the  Neumann  boundary  condition  on  the 
outer  boundary. 

In  order  to  obtain  the  final  finite-element  form,  these 
equations  are  discretized  and  presented  as  a  linear  sys¬ 
tem  of  equations 

Ku  =  b  (7) 

for  u  the  unknown. 


3  Nodal  Mapping 

The  SIMD  computer  used  is  Thinking  Machines  Cor¬ 
poration’s  Connection  Machine  2  (CM-2).  Briefly,  the 
CM-2  is  described  as  a  SIMD  (Sequential  Instruction 
Multiple  Data)  or  data  parallel  type  of  parallel  com¬ 
puter.  This  means  that  each  computer  instruction  oper¬ 
ates  on  data  stored  in  a  processor  array.  Each  processor 
in  the  array  holds  a  single  data  item.  The  CM-2  may  be 
configured  to  have  up  to  64k  (k=1024)  physical  proces¬ 
sors  each  with  its  own  local  memory.  Computationally, 
each  physical  processor  may  be  subdivided  into  some 
number  of  virtual  processors  where  the  computational 
resources  of  the  physical  processor  are  shared  among  its 
virtual  processors.  The  virtual  processor  ratio,  then,  is 
the  ratio  of  the  number  of  virtual  processors  assigned 
to  each  physical  processors  and  must  equal  an  integer 
power  of  2.  For  a  more  complete  description,  see  [3]. 

While  SIMD  computers  have  been  in  existence  for 
a  number  of  years,  finite  element  algorithms  for  them 
have  been  few.  One  reason  is  that,  ais  with  all  parallel 
architectures,  techniques  which  may  be  mature  on  serial 
computers  must  be  altered  and  sometimes  discarded  in 
favor  of  more  applicable  algorithms.  This  pape^  intro¬ 
duces  a  new  nodal  basis  mapping  of  the  finite  element 
algorithm  onto  the  CM-2. 

One  difficulty  with  implementing  finite  element  algo¬ 
rithms  on  a  SIMD  computer  is  the  choice  of  the  data 
item.  To  achieve  a  relatively  high  level  of  efficiency 
as  well  as  to  limit  communication,  a  data  item  which 
may  be  maintained  throughout  the  algorithm  is  desir¬ 
able.  Typically,  finite  element  algorithms  operate  on 
an  elemental  level  during  the  calculation  of  the  system 
of  equations  and  then  assemble  these  elemental  equar 
tions  to  a  global  set  of  equations  which  exist  on  the 
nodal  level.  This  global  set  of  equations  is  then  solved 
to  yield  results  at  the  nodal  level.  This  may  be  seen 
as  having  two  different  data  items  during  different  por¬ 
tions  of  the  program  and  previous  implementations  of 
this  mapping  on  the  CM-2  have  proved  inefficient  [4], 
[5].  To  avoid  this  inefficiency,  an  algorithm  which  uses 
a  nodeil  level  data  set  throughout  the  program  has  been 
developed  for  use  on  the  CM-2.  While  the  solution  on 
the  nodal  level  remains  basically  the  same  as  previous 
finite  element  algorithms  on  the  CM-2  [6],  [7],  the  cal¬ 
culation  of  the  system  of  equations  is  done  on  the  nodal 
level  using  what  has  been  termed  nodal  assembly. 


3.1  Mesh  Generation 

The  nodal-basis  mapping  assigns  a  node  to  a  processor. 
This  mapping  is  maintained  throughout  the  program, 
from  discretization  through  solution.  During  discretiza¬ 
tion,  each  processor  calculates  its  position  in  the  prob- 
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tern  domain  based  on  information  which  describes  the 
domain  geometry.  Each  processor  also  determines  its 
boundary  status.  To  enhance  the  speed  of  the  program, 
a  parallel  0-grid  mesh  generator  is  used  to  generate 
meshes.  The  0-grid  meshes  allow  the  use  of  nearest 
neighbor  (NEWS)  communication  grid  while  the  par¬ 
allel  mesh  generation  means  that  only  geometry  data 
need  be  specified  on  a  front-end  preprocessor. 

A  mesh  is  generated  by  the  set  of  points  formed  by 
the  intersection  of  the  lines  of  a  boundary  conforming 
curvilinear  coordinate  system.  The  problem  of  inter¬ 
est  is  a  two-dimensional,  multiply-connected,  arbitrary 
region  with  specified  inner  and  outer  boundaries.  The 
boundary  values  are  specified  in  cartesian  coordinates 
(x,y)  and  are  transformed  to  curvilinear  coordinates 
(s,<).  In  the  transformed  region,  algebraic  interpolar 
tion  is  used  to  generate  the  physical  cartesian  coordi¬ 
nates  (x,y).  See  [8]  for  a  complete  description. 


3.2  Nodal  Assembly 

The  nodal  assembly  technique  makes  use  of  the  concept 
of  a  nodal  region  which  contains  a  given  node  and  its 
neighboring  nodes  and  elements  as  in  Figure  2.  Each 
processor  simply  calculates  the  local  interaction  coeffi¬ 
cients  associated  with  its  row  in  the  global  system  of 
equations  as  well  as  the  forcing  value.  Since  the  in¬ 
teractions  are  local,  nearest-neighbor  communication 
is  used.  This  portion  of  the  algorithm  is  somewhat 
inefficient  in  applying  boundary  conditions  since  only 
processors  which  represent  boundary  nodes  are  active 
during  this  phase  of  the  program.  However,  this  may 
only  be  slightly  detrimental  to  the  overall  efficiency  of 
the  program  if  the  boundary-condition  calculations  are 
not  too  laborious. 


3.3  System  Solution 

Once  calculated,  the  system  of  equations  is  solved  us¬ 
ing  a  conjugate-gradient  based  algorithm  [9].  Conju¬ 
gate  gradient  algorithms  have  been  used  previously  on 
the  CM-2  for  the  solution  of  linear  systems  [6],  [7]. 
This  is  because  they  are  a  collection  of  various  matrix 
and  vector  operations  which  can  be  performed  with  a 
high  level  of  concurrency.  Further,  in  the  case  of  a 
regular  grid,  all  the  system  coefficients  represent  local 
interactions  and  so  any  interprocessor  communication 
will  be  nearest  neighbor.  Thus,  communication  is  also 
optimized  using  this  solution  technique.  However,  in 
contrast  with  previous  finite  element  algorithms  on  the 
CM-2,  the  conjugate  gradient  algorithm  used  here  is 
one  which  must  handle  a  complex-valued,  non-positive 
definite  system  of  equations.  It  is  given  as 


Figure  2;  Nodal  Region  of  node  “D”  indicating  nearest 
neighbor  nodes  and  adjacent  elements. 


Initialize: 


To  = 

b-Kuo 

(8) 

Po  = 

K^ro 

(9) 

Iterate: 

1  K^r.-  P 

(10) 

a,  = 

1  Kp.-  P 

Uj  +  l  = 

u<  +  a<Pi 

(11) 

r<+i  = 

r,  -  a,Kp, 

(12) 

h  — 

1  K^r.+,  p 

(13) 

o,  — 

1  K^r.  P 

Pi+1  = 

K^r.+i  -1-  biPi 

(14) 

where  the  choice  of  uq  is 

arbitrary.  Note  here  that  the 

matrix-transpose  implies  the  conjugate-transpose. 

Figure  3  is  a  flow  chart  of  the  finite  element  program 
as  implemented  on  the  CM-2. 


4  Results 

The  method  described  above  has  been  implemented  on 
the  CM-2  using  the  C-Paris  (PARallel  Instruction  Set) 
programming  protocol  for  the  Connection  Machine  [10]. 
This  program  was  used  to  obtain  results  for  the  solu¬ 
tion  of  electromagnetic  wave  scattering  from  a  variety 
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Figure  3:  Flow  chart  for  CM-2  nodal-basis  finite  ele¬ 
ment  program. 


of  2-dimensional  objects.  Table  1  gives  some  test  prob¬ 
lems  for  scattering  from  perfect  electric  conducting  ob¬ 
jects.  The  first  4  cases  are  all  cylindrical  shapes  for 
which  a  semi-analytical  solution  is  available  for  accu¬ 
racy  verification.  The  last  case  is  an  airfoil  with  NACA 
number  0010.  AH  floating  point  calculations  are  done 
using  32  bit  arithmetic  and  the  floating-point  acceler¬ 
ation  hardware  available  on  the  CM-2.  The  conjugate 
gradient  algorithm  was  halted  when  the  following  was 
satisfied 


<  10-^ 


(15) 


Figures  4-13  represent  magnitude  and  phase  plots 
of  the  fields  for  the  cases  listed  in  Table  1.  In  each  case, 
the  incident  plane  wave  is  taken  as  traveling  in  the  x- 
direction.  The  total  field  magnitude  becomes  zero  on 
the  boundary  and  displays  a  shadow  region  behind  the 
conducting  body.  The  phase  plots  illustrate  that  lines 
of  constant  phase  approach  the  perfect  conducting  inner 
boundary  at  normal  incidence.  This  follows  from  the 
boundary  condition  that 


Etan  =  0 


Figure  4;  Case  1  magnitude;  Total  field  magnitude  for 
scattering  from  a  perfect  electric  conducting  cylinder 
with  a  =  3A  and  h  =  b\ 


18i'. 


140  , 
99  9 


on  the  boundary. 

Table  2  gives  timing  and  Megaflop  ratings  achieved 
on  the  same  problems.  All  timing  results  were  obtained 
using  the  CM  timing  facility.  As  Table  2  illustrates, 
projected  floating  point  computations  from  200-400 
MFlops  have  been  achieved  during  both  phases  of  the 
algorithm.  Further,  the  MFlop  ratings  extrapolated  to 
the  same  virtual  processor  ratio  run  on  a  full  64k  CM-2 


Figure  5;  Case  1  phase:  Total  field  phase  for  scattering 
from  a  perfect  electric  conducting  cylinder  with  a  =  3A 
and  6  =  5A 
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Figure  6:  Case  2  magnitude:  Total  field  magnitude  for 
scattering  from  a  perfect  electric  conducting  cylinder 
with  a  =  lOA  and  b  =  14A 


Figure  8:  Case  3  magnitude;  Total  field  magnitude  for 
scattering  from  a  perfect  electric  conducting  cylinder 
with  a  —  lOA  and  6  =  14A.  Twice  the  nodal  density  was 
used  in  both  the  radial  and  circumferential  directions 
fis  for  Case  2 


Figure  7:  Case  2  phase;  Total  field  phase  for  scattering 
from  a  perfect  electric  conducting  cylinder  with  a  =  lOA 
and  6  =  14A 


Figure  9:  Case  3  phase:  Total  field  phase  for  scattering 
from  a  perfect  electric  conducting  cylinder  with  a  =  lOA 
and  b  =  14A.  Twice  the  nodal  density  was  used  in  both 
the  radial  and  circumferential  directions  as  for  Case  2 


Figure  10:  Case  4  magnitude:  Total  field  magnitude  for 
scattering  from  a  perfect  electric  conducting  cylinder 
with  a  =  30A  and  6  =  32A 


Figure  11:  Case  4  phase:  Total  field  phase  for  scattering 
from  a  perfect  electric  conducting  cylinder  with  a  =  30A 
and  b  =  32A 


Figure  12:  Airfoil  magnitude:  Total  field  magnitude 
for  scattering  from  a  perfect  electric  conducting  airfoil 
with  chord  length  =  5A  eind  b  =  9.5A.  NACA  number 
is  0010 
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Figure  13:  Airfoil  phase:  Total  field  phase  for  scatter¬ 
ing  from  a  perfect  electric  conducting  airfoil  with  chord 
length  =  5A  and  6  =  9.5A.  NACA  number  is  0010 


show  capabilities  on  the  order  of  1.5  GFlops  for  both 
portions.  The  finite  element  mapping  described  above 
will  allow  the  solution  of  problems  in  excess  of  4  million 
nodes  on  a  fully  configured  CM-2  with  the  larger  256- 
kbit  memory  chips.  Because  of  this  capability,  objects 
of  electrical  sizes  (dimension  in  terms  of  wavelengths) 
exceeding  100  wavelengths  may  studied  using  the  finite 
element  method.  This  has  not  previously  been  possible. 

Figure  14  shows  a  plot  of  the  “relative  speedup” 
demonstrated  by  the  Fill  and  Solve  portions  of  the  pro- 
greun.  This  “relative  speedup”  is  defined  as 


Table  1:  Test  cases. 


Case 

a 

b 

Num.  Nodes 

V.P. 

Ratio 

Circ. 

Radial 

Total 

1 

3  A 

5  A 

512 

32 

16384 

1 

2 

10  A 

14  A 

1024 

64 

65536 

4 

3 

10  A 

14  A 

2048 

128 

262144 

16 

4 

30  A 

32  A 

2048 

32 

65536 

4 

5 

5  A* 

9.5  A 

1024 

128 

131072 

8 

*  -  the  chord  length  of  the  airfoil  was  taken  as  5A. 


where  tminp  is  the  execution  time  of  a  given  problem 
on  the  minimum  number  of  processors  possible  (highest 
virtual  processor  ratio)  and  t„p  is  the  execution  time  on 
some  number  of  processors.  Note  that  the  graph  illus¬ 
trates  the  speedup  over  the  largest  possible  range  of 
these  ratios  for  this  program  using  a  CM-2  with  pro¬ 
cessor  memories  of  64k-bits. 


Table  2;  Timings  and  Mflop  ratings  for  the  above  cases. 


*  raeiltr.S|nif 
o  Faspeeif 
IVactalSindif 


Time  (s) 

MFlops  [ 

Case 

Fill 

Solve 

Total 

Fill 

Solve 

1 

0.05 

33.52 

113.25 

365 

211 

2 

0.17 

137.88 

314.65 

430 

322 

3 

0.78 

1811.82 

1967.79 

375 

399 

4 

0.17 

125.24 

224.72 

430 

318 

5 

0.32 

- 

6798 

457 

Table  3:  Mflop  ratings  extrapolated  to  the  virtual  pro¬ 
cessor  ratio  implemented  on  a  full  64k  processor  CM-2. 


VP 

Ratio 

Projected  MFlops  | 

Fill 

Solve  U 

1 

1461 

846  1 

4 

1719 

1289  1 

16 

1500 

1596  n 

Figure  14:  Relative  Speedup. 
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5  Conclusions  and  Future  Re¬ 
search 

Using  the  nodal-assembly  technique,  a  finite  element 
program  is  implemented  on  a  data  parallel  computer 
in  a  manner  which  allows  the  use  of  the  same  data 
structure  throughout  the  program,  from  discretization 
through  solution.  Nodal-assembly  mapping  provides 
for  a  relatively  efficient  program. 

From  the  work  presented  here,  conclusions  may  be 
drawn. 

•  Noded  assembly  allows  the  mapping  of  one  finite  el¬ 
ement  node  onto  one  virtual  processor.  This  maj>- 
ping  is  maintained  throughout  the  program. 

•  Using  first  order  quadrilaterals  and  a  regular  mesh, 
the  mapping  may  be  configured  in  a  NEWS  grid, 
allowing  nearest-neighbor  communication. 

•  The  mapping  described  above  will  permit  a  max¬ 
imum  virtual  processor  ratio  of  16  under  the  cur¬ 
rent  CM-2  memory  limitations  (64k  bits  per  pro¬ 
cessor).  On  a  64k  processor  machine,  this  allows  a 
maximum  of  1048576  nodes. 

•  Nodal  assembly  is  inefficient  when  handling  bound¬ 
ary  conditions.  This  is  because  only  processors  on 
a  given  boundary  are  active  during  this  portion  of 
a  program. 

•  Nodal  basis  mapping  is  well  suited  for  use  with  a 
conjugate-gradient  iterative  solution.  All  the  ma¬ 
trix  and  vector  operations  can  be  computed  with  a 
high  level  of  concurrency.  Nearest  neighbor  com¬ 
munication  is  again  used  in  performing  the  matrix- 
vector  products. 

With  respect  to  the  CM-2,  several  things  need  be 
said.  First,  although  all  these  examples  were  run  on 
a  machine  with  only  a  32  bit  floating  point  accelera¬ 
tor,  64  bit  accelerators  are  now  available  to  allow  dou¬ 
ble  precision  floating  point  calculations  in  hardware. 
Second,  the  individual  processor  memory  has  been  in¬ 
creased  from  64k  bits  to  256k  bits  on  some  machines. 
This  would  effectively  allow  the  solution  of  problems 
with  4  times  the  number  of  nodes.  The  64k  bits  of  pro¬ 
cessor  memory  allowed  for  a  maximum  virtual  processor 
ratio  of  16  for  10485760  nodes.  The  new  memory  size 
would  allow  a  virtual  processor  ratio  of  64  for  41943040 
nodes. 

Further  research  into  data  parallel  techniques  and 
their  use  in  the  solution  of  scattering  problems  is  on¬ 
going.  These  include  an  extension  to  3-dimensional  fi¬ 
nite  elements,  effectiveness  of  the  absorbing  boundary 
condition  for  both  2-  and  3-dimensional  problems  and 


convergence  properties  of  the  conjugate  gradient  algo¬ 
rithm  when  applied  to  complex  geometries. 

Also,  with  respect  to  mesh  generation,  several  ar¬ 
eas  require  further  investigation.  These  include  par¬ 
allel  mesh  generation,  mesh  refinement  techniques  as 
well  as  other  interpolation  schemes.  Mesh  refinement 
techniques  allow  the  program  to  actively  alter  the  node 
distribution  in  the  physical  domain.  This  permits  more 
nodes  to  be  allocated  in  regions  where  the  solution  is  ex¬ 
pected  to  vary  rapidly  and  fewer  nodes  in  regions  where 
the  solution  is  expected  to  be  relatively  constant.  Thus, 
a  better  approximation  to  the  exact  solution  is  obtained 
for  a  given  number  of  nodes. 
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Abstract 

An  2D  electromagnetic  finite  element  analysis  code  which 
runs  on  the  JPUCaltech  Mark  Illfp  Hypercube  is  being 
upgraded  to  handle  fully  3  dimensional  scattering 
problems.  The  EM  code  uses  finite  elements  [1]  to  model 
a  finite  problem  domain  which  may  include  regions  of 
anisotropic  or  nonuniform  dielectric  properties.  It  solves 
the  single  frequency  source  driven  vector  wave  equation  for 
electric  or  magnetic  fields  in  this  domain.  The  code  is 
being  implemented  as  a  testbed  for  finite  elements  as 
applied  to  EM  problems,  with  several  types  of  elements, 
radiation  boundary  condition  strategies,  and  parallel 
solvers. 

Introduction 

The  general  electromagnetic  scattering  problem  is 
computationally  taxing  for  even  the  most  powerful 
present  day  computers.  As  depicted  in  Figure  1,  the 
problem  may  be  represented  as  a  set  of  scatterers  contained 
within  a  computational  domain.  The  scatterers  may  be 
electrically  complicated  objects,  consisting  of  various 
dielectric  materials  and  conductors.  The  objects  are 
illuminated  by  a  known  incident  field,  and  the  scattered 
electromagnetic  radiation  is  to  be  determined.  The 
accurate  geometric  representation  of  each  object  is 
necessary  in  order  to  correctly  compute  the  field  solution 
for  the  interesting  case,  i.e.,  when  the  wavelength  of  the 
incident  radiation  is  comparable  in  size  to  the  features  of 
the  scatterers  themselves.  There  are  several  methods 
capable  of  solving  the  problem,  each  having  different 
requirements  for  storage,  computation,  and  geometric 
complexity.  We  have  implemented  a  parallel  EM  analysis 
code  which  uses  the  finite  element  technique  [1]  to  model 
the  problem  domain.  Our  choice  of  finite  elements  is 


based  upon  the  need  to  accurately  model  the  scatterers, 
while  living  with  constraints  of  storage  and  epu 
performance.  A  mildly  complicated  3D  problem  ;an 
easily  require  the  solution  of  10^  simultaneous  equations. 


Fig.  1.  The  Canonical  Electromagnetic  Scattering 
Problem.  An  electric  or  magnetic  field  solution  is 
required  for  the  entire  problem  domain,  which  may 
include  several  scatterers  consisting  of  many 
materials. 

The  finite  element  method  decomposes  the  problem 
domain  into  a  set  of  contiguous  non-overlapping  elements 
of  various  shapes  which  can  support  a  piecewise 
continuous  function  with  various  degrees  of  smoothness. 
Linear  finite  elements  can  support  piecewise  linear 
functions,  while  higher  order  elements  can  support 
piecewise  quadratic  or  cubic  functions.  The  elements 
themselves  can  be  shaped  to  conform  to  the  problem 
geometry,  and  consist  of  a  set  of  nodes  at  which  are 
defined  local  orthonormal  basis  functions.  These  basis 
functions  are  zero  at  every  other  node,  and  have  non-zero 
value  only  within  the  elements  which  contain  the  given 
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node.  The  problem  solution  is  obtained  by  solving  a  set 
of  linear  equations  for  the  coefficients  of  these  basis 
functions,  which  usually  represent  the  values  of  the 
solution  at  the  nodal  points.  This  is  exactly  like  a  fourier 
decomposition  of  some  function,  except  that  the 
orthogonal  basis  functions  are  local  finite  elements  instead 
of  sines  and  cosines.  This  has  an  important  advantage  for 
parallel  processing  -  the  basis  functions  interact  only  with 
their  nearest  neighbors  and  only  require  local  knowledge  of 
the  problem  domain.  This  is  in  sharp  contrast  to 
something  tike  a  fourier  decomposition,  where  each  basis 
function  must  sample  the  entire  problem  domain.  The 
matrix  equation  which  results  from  a  finite  element  model 
is  also  extremely  sparse,  which  translates  into  the  ability 
to  meticulously  model  the  scatterer  geometry  with  an 
appropriate  density  of  elements,  while  using  a  more  coarse 
mesh  in  less  critical  regions. 

Code  Structure 

Parallelization  of  a  finite  clement  code  is  best  achieved  by 
partitioning  the  problem  domain  among  the  processors. 
The  local  nature  of  the  finite  element  means  that  the 
matrix  of  linear  equations  can  be  assembled  entirely  in 
parallel,  with  each  processor  constructing  the  part  of  the 
matrix  which  corresponds  to  its  subset  of  elements.  The 
sparse  matrix  which  results  represents  couplings  among 
nearest  neighbors  on  the  finite  clement  grid.  These 
interactions  can  be  visualized  by  looking  at  a  sample 
finite  clement  grid.  In  Fig.  2,  a  sample  4  node 
quadrilateral  finite  element  grid  has  been  partitioned  (not 
optimally)  among  four  processors.  Each  equation  couples 
1  node  to  its  nearest  neighbors  through  finite  elements 
which  share  that  node,  so  in  the  grid  depicted,  a  node 
couples  to  at  most  9  other  nodes.  For  nodes  in  the 
interior  of  the  subdomains  (shown  in  black),  the  equation 
which  represents  it  involves  entries  which  arc  local  to  its 
parent  proccs.sor.  Tho.se  nodes  on  the  subdomain  surfaces 
require  information  which  must  be  obtained  from  2  or 
more  proccs.sors.  Thus  a  partitioning  sualcgy  for  parallel 
prcKCssing  must  seek  to  minimize  the  surface  area  of  thc.se 
subdomains  while  dividing  the  problem  domain  into 
pieces  of  approximately  equal  volume. 

To  satisfy  these  requirements,  we  use  a  Recursive  Inertial 
Partitioning  Algorithm.  The  algorithm  is  designed  for 
hypcrcubc  topology,  where  the  number  of  proccs.sors  is  a 
power  of  2.  It  proceeds  as  follows.  First  determine  the 
"center  of  mass"  of  the  grid.  Then  determine  the  axis 
through  the  center  of  mass  with  minimum  moment  of 
inertia.  Bi.scct  the  grid  with  a  plane  through  the  center  of 
mass  which  is  perpendicular  to  this  axis.  Then  repeat  the 
procedure  on  each  subgrid  until  the  number  of  pieces 


Fig.  2.  A  sample  finite  element  domain 
decomposition.  Each  processor  has  a  mutually 
exclusive  subset  of  the  elements.  Nodes  internal  to 
the  subdomains  are  exclusive  to  the  parent  processor, 
while  nodes  on  the  surfaces  must  be  shared. 

corresponds  to  the  number  of  processors.  Refinements  to 
this  decomposition  can  be  made  using  simulated  annealing 
to  further  minimize  the  surface  areas  of  the  subgrids,  but 
the  technique  works  best  with  user  interaction.  In  most 
cases,  the  load  balance  and  communication  requirements 
which  result  by  just  doing  the  recursive  inertial 
partitioning  are  sufficient. 

The  EM  analysis  code  has  been  designed  as  a  testbed  for 
the  finite  element  technique.  To  that  end,  we  have 
implemented  several  types  of  finite  elements,  radiation 
boundary  conditions,  and  parallel  solvers.  The  number 
and  sparsity  of  the  linear  equations  set  immediately 
suggest  an  iterative  solver  such  as  the  Bi-Conjugaic 
Gradient  method  [2].  This  method  requires  only  the 
computation  of  matrix-vector  products  and  dot  products, 
and  with  exact  arithmetic  will  compute  the  solution  in  n 
iterations,  where  n  is  the  rank  of  the  matrix.  In  practice,  a 
sufficiently  accurate  solution  can  often  be  obtained  in 
some  fraction  of  n  iterations.  In  our  parallel 
implementation  of  the  Bi-conjugate  gradient  algorithm, 
the  matrix  is  never  completely  assembled.  Each  processor 
computes  and  retains  matrix  entries  for  only  those  nodes 
in  its  subdomain.  Matrix  entries  corresponding  to 
subdomain  surface  nodes  are  only  partially  assembled. 
Iiaoh  processor  dcx:s  a  piece  of  the  mauix-vector  multiply 
or  dot  product  at  each  iteration  step,  with  results  being 
globally  assembled  only  for  the  surface  nodes.  Since  the 
computation  required  for  a  matrix-vector  product  or  a  dot 
product  scales  like  the  volume  of  the  subdomains,  while 
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the  communication  scales  like  the  surface  area,  this 
algorithm  becomes  more  efficient  as  problem  size 
increases. 

We  are  also  investigating  a  hybrid  bi-conjugate  gradient  / 
gaussian  elimination  algorithm.  This  method  uses 
gaussian  elimination  to  remove  the  interior  nodes  in  favor 
of  the  surface  nodes.  Bi-conjugate  gradients  are  then  used 
to  complete  the  solution.  The  gaussian  elimination  step 
can  be  done  entirely  without  communication,  and  has  the 
advantage  of  producing  a  partially  inverted  matrix.  It  has 
the  disadvantage  of  requiring  more  storage,  since  sparsity 
is  lost  during  the  gaussian  elimination  step. 

Results 

The  Electromagnetic  Finite  Element  Code  (EMFEC) 
consists  of  four  major  sections.  During  the  input  / 
initialization  phase,  the  finite  element  model,  hypercube 
partitioning  information,  and  excitation  parameters  are 
read.  The  model  is  translated  into  a  local  node  and 
clement  numbering  scheme,  and  the  communication 
routing  is  determined  for  the  solver.  In  the  current 
implementation,  every  processor  must  read  the  entire 
finite  element  model  and  partitioning  information  to 
determine  which  nodes  and  elements  it  requires.  Element 
setup  is  done  next.  This  consists  of  computing  entries  in 
the  sparse  stiffness  matrix  and  entries  in  the  force  vector 
for  all  elements  The  solver  is  the  third  phase  of 
computation.  For  the  bi-conjugate  gradient  solver, 
iterations  on  an  initial  guess  are  performed  until  a 
solution  with  residual  error  below  a  specified  tolerance  is 
obtained.  Finally,  the  solution  is  collected  from  all  of  the 
processors  and  written  to  a  file. 

Performance  of  each  of  these  code  sections,  as  welt  as  the 
entire  code  itself,  is  illusUated  in  Fig.3.  Here  we  plot 
speedup  for  a  fixed  test  problem,  consisting  of  a  2D  finite 
element  model  of  a  dielectric  cylinder  with  electric 
permativity  e  =  2.56,  and  unit  radius.  The  model  contains 
2304  quadrilateral  elements,  and  a  total  of  9313  nodal 
points.  We  define  speedup  s  as 

s  =  Tg/Tp 

where  Tg  is  the  execution  time  for  the  problem  on  one 
processor  and  Tp  is  the  execution  time  on  multiple 
processors.  The  element  setup  and  solver  sections  of  the 
code  show  almost  uniform  speedup  as  the  number  of 
processors  is  increased,  achieving  almost  75%  efficiency 
on  32  processors.  The  output  code  section  exhibits 
performance  saturation  almost  immediately,  due  to  the 
hardware  bottleneck  in  transmitting  data  to  the  hypcrcube 


Figure  3.  Speedup  versus  number  of  processors  for  a 
dielectric  cylinder  test  case.  The  floating  point 
intensive  code  sections  exhibit  uniform  speedup  with 
increasing  processor  numbers.  The  I/O  related  code 
sections  show  practically  no  speedup. 

host.  The  input  section  is  completely  flat,  since  each 
processor  must  read  the  entire  input  data  set.  Poor 
performance  of  these  two  code  sections  have  a  devastating 
impact  on  the  overall  code  performance.  The  entire  code 
has  a  measured  speedup  of  only  4  on  32  processors. 

This  effect  is  illustrated  in  Fig.  4,  where  we  have  plotted 
the  percentage  of  execution  time  spent  in  each  major  code 
section  as  a  function  of  the  number  of  processors  in  use. 
Running  on  1  processor,  65%  of  the  execution  time  is 
spent  in  the  compute  phases  of  the  code  (element  setup 
and  solver)  while  the  remaining  35%  of  execution  time  is 
spent  in  reading  the  model  and  writing  the  solution.  For 
32  processors,  only  10%  of  the  time  is  now  spent  on 
computation.  The  code  has  become  I/O  bound.  Clearly 
any  performance  improvement  must  come  from 
improving  the  I/O  code  sections,  since  they  currently 
derive  no  benefit  from  the  parallel  architecture.  Part  of  the 
problem  is  algorithmic  in  nature.  The  input  phase  of  the 
code  requires  that  each  processor  look  at  the  entire  finite 
element  model.  The  poor  performance  of  the  output  phase 
is  either  operating  system  or  hardware  related,  however, 
since  the  code  does  not  impose  any  ordering  of  the  data 
being  written  to  the  output  file. 

We  have  also  begun  work  on  a  3D  version  of  the  code. 
Preliminary  test  results  indicate  that  performance  for  each 
code  section  is  unchanged  for  the  larger  models  associated 
with  3D  finite  element  problems.  Overall  code  efficiency 
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Figure  4.  Percentage  of  execution  time  spent  in  each 
code  section  versus  number  of  processors.  Executing 
on  1  processor,  65%  of  the  execution  time  is  spend  in 
computation.  On  32  processors,  the  code  is 
completely  I/O  bound. 

is  slightly  improved,  since  the  computation  time  for  the 
solver  scalars  as  n^,  where  n  is  the  number  of  nodes  in  the 
model,  while  the  I/O  time  scales  linearly  with  n.  The 
larger  finite  element  models  resulting  from  3D  objects 
account  for  the  overall  improvement.  None  the  less,  a 
better  system  for  reading  the  model  is  called  for.  We  are 
investigating  the  possibility  of  assigning  node  and 
element  data  to  processors  in  a  card  dealing  fashion,  then 
rearranging  the  data  by  using  recursive  inertial  partitioning 
in  parallel.  The  absolute  bouleneck  of  transmitting  data 
to  or  from  the  host  computer  cannot  be  avoided. 
However,  by  not  requiring  every  processor  to  examine  the 
entire  model,  we  expect  that  we  can  obtain  some 
improvement  in  the  efficiency  of  the  input  phase  of  the 
code. 


Conclusions 

We  have  demonstrated  that  the  computational  phase  of  a 
finite  element  code  can  be  performed  efficiently  on  a 
concurrent  computer  like  the  Mark  Illfp  Hypcrcube.  The 
I/O  bottleneck  in  transferring  the  large  datasets  to  and 
from  the  host  computer  remains  a  problem,  limiting  the 
overall  efficiency  of  the  code.  We  arc  working  on  ways  of 
mitigating  the  effect  of  this  bottleneck,  though  hardware 
consuaints  will  ultimately  prevent  us  from  eliminating  it. 
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Abstract 

In  conjunction  with  the  development  of 
a  test  bed  for  finite  element  approaches 
for  solving  three-dimensional  electro¬ 
magnetic  scattering  problems  on  the 
JPL/Caltech  Mark  Illfp  Hypercube,  we 
have  developed  guidelines  for  choosing 
parameters  which  improve  accuracy  at 
the  cost  of  computational  resources.  We 
show  results  of  a  series  of  numerical  ex¬ 
periments  through  which  we  can  now 
predict  the  required  mesh  density  and 
the  radius  of  a  given  type  of  absorbing 
boundary  which  are  needed  to  obtain 
virtually  any  desired  degree  of  accuracy 
for  a  given  finite  element  type.  This  type 
of  guideline  set  is  important  for  three- 
dimensional  finite  element  computa¬ 
tions,  because  the  number  of  unknowns 
scales  as  the  cube  of  both  the  linear  node 
density  per  wavelength,  and  the  radius 
of  a  spherical  absorbing  boundary.  We 
examine  several  finite  element  formula¬ 
tions,  including  both  node-  and  edge- 
based  elements,  and  discuss  their  rela¬ 
tive  merits  for  accuracy  vs,  computa¬ 
tional  resources,  ease  of  parallel  imple¬ 
mentation,  and  parallel  efficiency. 

Introduction 

The  finite  element  method  is  well  suited 
to  electromagnetic  scattering  problems 
in  which  the  scattering  object  is  not  too 
large,  but  of  the  most  general  linear 
sort.  The  method  can  accurately  model 
EM  fields  in  domains  containing  conduc¬ 
tors,  lossy  and  lossless  dielectric  and 
magnetic  materials,  anisotropic  materi¬ 


als,  and  extremely  inhomogeneous  ma¬ 
terials.  The  method  is  also  well  suited  to 
parallel  computation:  with  elements 
spatially  partitioned  among  the  proces¬ 
sors,  the  filling  of  a  distributed  stiffness 
matrix,  and  its  iterative  solution  require 
very  little  interprocessor  communica¬ 
tion.  The  expanding  capacity  of  parallel 
computers  opens  the  vista  of  accurate 
solution  to  ever  larger  problems. 
However,  the  achievement  of  accurate 
solutions  often  depends  on  the  adequate 
choice  of  more  than  one  resource-in¬ 
tensive  parameter.  Open-region  scatter¬ 
ing  requires  truncation  of  the  domain 
with  some  sort  of  absorbing  boundary 
condition.  The  solution  gains  accuracy  as 
the  truncation  is  placed  farther  from  the 
object,  requiring  the  solution  to  a  larger 
system  of  equations.  This  added  cost  will 
buy  nothing  if  the  accuracy  is  near  the 
limits  implied  by  other  factors. 
Assuming  sufficient  care  is  taken  in 
modeling  the  object  and  performing 
numerical  integrations,  the  remaining 
factors  of  interest  are  the  formulation  of 
the  finite  element,  and  the  element  spa¬ 
tial  density.  The  latter  is  the  most  re¬ 
source-intensive.  Ideally,  one  uses  the 
most  accurate  element  formulation  for 
the  problem,  then  chooses  the  element 
density  and  the  size  of  the  truncated  do¬ 
main  such  that  the  accuracy  limitations 
of  each  are  balanced. 

Finite  Element  Formulations 

Three  types  of  electromagnetic  scatter¬ 
ing  problems  are  addressed  by  analysis 
codes  which  run  on  the  JPL/Caltech 
Mark  lllfp  Hypcrcuhe.  The  simplest  code 
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solves  the  2-D  scalar  Helmholtz  equation, 
using  finite  elements  filling  a  circular 
truncated  domain.  The  numerical  tech¬ 
nique  is  that  used  by  the  sequential  code 
of  [1].  The  absorbing  boundary  condition 
used  is  the  second-order  condition  of  [2]. 
Several  finite  element  formulations  are 
supported,  including  linear  and 
quadratic  elements,  triangular  and 
quadrilateral  elements,  and  total  field 
and  scattered  field  elements.  Another 
code  solves  the  2-D  vector  Helmholtz 
equation,  and  is  primarily  a  test-bed  for 
investigating  various  element  types.  The 
vector  problem  admits  several  additional 
types  of  element  formulation,  including 
analogues  of  the  scalar  problem  finite 
element  varieties,  but  also  element  types 
implying  different  means  of  represent¬ 
ing  the  vector  components.  Node-based 
vector  elements  solve  for  the  two 
orthogonal  vector  components  as  two  in¬ 
dependent  unknowns.  Edge-based  ele¬ 
ments  solve  for  the  tangential  compo¬ 
nent  of  the  field  along  each  element 
edge.  To  date,  the  code  supports  node  and 
edge  elements.  Even  more  exotic  element 
types  are  possible.  The  final  code  solves 
the  full  3-D  vector  Helmholtz  equation, 
using  analogues  of  the  2-D  vector  ele¬ 
ments. 

Parametric  Study 

We  have  performed  a  parametric  study 
of  the  solution  accuracy  for  a  canonical 
problem  using  the  2-D  scalar  code.  The 
problem  studied  is  the  scattering  from  a 
dielectric  2.56  cylinder  with  radius  a  and 
wave  number  k  such  that  ka  =  1.  This 
problem  has  a  well-known  analytic  so¬ 
lution,  which  we  used  to  measure  the  ac¬ 
curacy  of  the  finite  element  solution. 
Through  trial  and  error,  we  found  a 
combination  of  element  formulation, 
element  density,  and  truncated  domain 
size  which  resulted  in  extremely  high 
accuracy.  We  then  systematically  re¬ 
duced  the  element  density  and  the  do¬ 
main  size  independently,  to  determine 
the  accuracy  impact  of  each  factor  sepa¬ 
rately.  The  element  type  used  is  the 
quadratic,  isoparametric  (i.  e.,  curved) 
9  node  (Lagrangian)  quadrilaterals 


modelling  the  scattered  field.  This  ele¬ 

ment  produced  more  accurate  results 
than  the  quadratic  triangle  element  and 
the  total  field  elements,  and  was  a  con¬ 
siderable  improvement  over  any  linear 
element.  The  element  density  is  15.7  per 
wavelength,  and  the  domain  radius  r  is 
such  that  kr  =  4.2.  The  accuracy  is  such 
that  one  relative  error  measure  (the 
magnitude  of  the  far-field  error  at  each 
angle  divided  by  the  maximum  magni¬ 
tude  of  the  far  field)  is  <  10“^*  This  im¬ 
plies  an  error  in  the  RCS  of  <  1/10  dB 
over  a  60  dB  range.  The  parametric 

study  results  in  curves  of  the  relative 
error  vs.  kr  and  kh,  where  h  is  the 

minimum  finite  element  side.  Note  that 

in  this  case  k  must  be  interpreted  as  the 
wavelength  inside  the  material;  for  ex¬ 
ample,  high  dielectric  materials  require 
a  finer  grid,  by  a  factor  of  the  square 
root  of  the  relative  dielectric.  Both 
curves  display  power-law  characteris¬ 
tics.  The  kh  curve  goes  as  (kh)^,  which 
matches  the  theoretical  behavior  of  field 
RMS  error  for  quadratic  elements.  The 
kr  curve  goes  approximately  as  (kr)~^ , 
but  we  know  of  no  justification  of  this 
behavior  from  numerical  theory.  We 
believe  the  kh  behavior  represents  a 
reliable  rule  for  any  size  and  composi¬ 
tion  of  scattering  object,  while  the  k  r 
behavior  may  apply  only  to  this  prob¬ 
lem.  To  obtain  a  rough  rule  of  thumb  for 
size  of  the  truncation  domain,  a  large 
scattering  object  was  modeled:  a  perfect 
conducting  circular  cylinder  with 
ka  =  50.  Good  agreement  (within  2  dB) 
with  the  analytic  RCS  was  obtained  by 
using  kr  =  62,  while  kr  =  56  was  not 
considered  adequately  accurate  (errors 
exceed  4  dB).  At  the  level  of  accuracy 
represented  by  the  kr  =  62  case,  using 
more  than  3  or  4  quadratic  elements  per 
wavelength  proves  wasteful.  We  conjec¬ 
ture  that  with  the  use  of  the  second- 
order  Bayliss  Turkel  condition,  using  r 
about  25%  larger  than  the  object  half¬ 
diameter  will  generally  produce  RCS 
curves  with  similarly  high  quality. 
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Vector  Elements 


Similar  accuracy  studies  of  vector  ele¬ 
ments  are  proceeding,  and  results  will  be 
reported  soon.  A  comparison  has  been 
made  between  linear  triangular  edge 
elements  and  node-based  elements  for 
the  ka  =  \,  dielectric  2.S6  circular  cylin¬ 
der,  using  30  elements  per  wavelength 
and  a  modified  Sommerfeld  boundary 
condition  at  kr  =  3.  The  node-based  ele¬ 
ments  produced  a  more  accurate  RCS. 
Both  types  of  element  are  convenient  for 
parallel  partitioning,  with  the  level  of 
inconvenience  depending  on  the  form 
of  model  specification.  For  mesh  genera¬ 
tors  which  produce  a  model  in  the  form 
of  an  ordered  list  of  nodal  coordinates 
and  element  nodes,  the  edge  elements 
require  an  extra  computational  step: 
compiling  a  list  of  numbered  edges 
owned  by  each  element.  With  respect  to  a 
given  triangular  mesh,  the  edge  ele¬ 
ments  should  imply  a  slightly  faster  so¬ 
lution:  a  mesh  node  is  shared  by  six  ele¬ 
ments,  while  an  edge  is  shared  by  only 
two,  implying  both  a  sparser  matrix  and 
less  communication  for  the  edge  ele¬ 
ments.  However,  this  advantage  may 
prove  to  be  outweighed  by  considera¬ 
tions  of  accuracy. 
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Abstract 

The  Fokker-Planck  package  FPPAC  [1,2],  which 
solves  the  complete  nonlinear  multispecies  Fokker- 
Planck  collision  operator  for  a  plasma  in  two- 
dimensional  velocity  space,  has  been  rewritten  for  the 
Connection  Machine  2.  This  has  involved  allocation 
of  variables  either  to  the  front  end  or  the  CM2, 
minimization  of  data  flow,  and  replacement  of  Cray- 
optimized  algorithms  with  ones  suitable  for  a 
massively  parallel  architecture.  Coding  has  been 
done  utilizing  Connection  Machine  Fortran. 
Calculations  have  been  carried  out  on  various 
Connection  Machines  throughout  the  country. 
Results  and  timings  on  these  machines  have  been 
compared  to  each  other  and  to  those  on  the  static 
memory  Cray-2  at  the  National  Magnetic  Fusion 
Energy  Computer  Center.  For  large  problem  size,  the 
Connection  Machine  2  is  found  to  be  cost-efficienL 


Introduction 

Over  the  past  several  decades  there  has  been  a  tremen¬ 
dous  increase  in  computer  power.  Most  of  this  increase 
has  been  due  to  improvements  in  hardware.  Electrical 
components  have  now  become  so  efficient,  however, 
that  more  recently  the  concentration  has  been  on  im¬ 
proving  the  architecture  of  the  computer.  This  had  led 
to  shared  memory  multiprocessor  supercomputers  such 
as  the  Cray-2  and  the  Cray-YMP. 

Although  these  multiprocessor  devices  have  had  a  sub¬ 
stantial  impact  on  high  speed  computing,  it  has  been 
recognized  that  a  more  cost-effective  approach  might  be 
to  link  togetho-  very  large  numbers  of  slower,  cheaper 
processors,  each  with  its  own  local  memory.  Such 
massively  parallel  computers,  although  not  at  this  time 
general  purpose,  have  performed  quite  impressively  in  a 
number  of  problem  areas  and  are  being  used  selectively 
in  production  mode.  Many  members  of  the  computing 


community  are  beginning  to  think  in  terms  of  their  use 
as  general  i»oduction  machines  in  the  not-too-distant 
future. 

There  are  fundamentally  two  different  types  of  mas¬ 
sively  parallel  architectures — single  instruction,  multi¬ 
ple  data  (SIMD)  and  multiple  instruction,  multiple  data 
(MIMD).  In  an  SIMD  device  all  of  the  processors  exe¬ 
cute  the  same  instruction  in  lock-step  fashion,  whereas 
in  an  MIMD  device  each  processor  may  follow  its  own 
set  cf  instructions.  Programming  an  SIMD  machine  is 
analogous  to  vectorizing  (on  a  Cray),  whereas  program¬ 
ming  an  MIMD  machine  amounts  to  true  multitasking. 
Quite  naturally,  SIMD  machines  are  cheaper  per  pro¬ 
cessor  and  inherently  easier  to  program.  However,  they 
generally  demand  greater  parallelism  in  the  algorithm 
and  offer  less  flexibility,  llie  most  prominent  example 
of  an  SIMD  device  is  the  Connection  Machine  2 
(CM2),  the  computer  used  in  this  investigation.  Manu¬ 
facturers  of  MIMD  machines  include  INTEL,  NCUBE, 
andBBN. 

This  study  involves  the  massive  parallelization  of  a 
code  known  as  FPPAC  [1,2],  which  time-integrates  the 
Fokker-Planck  collision  operator  in  a  plasma.  Codes 
such  as  FPPAC  are  used  to  simulate  collisional  phe¬ 
nomena  in  magnetically  confined  plasmas,  particularly 
in  situations  where  the  charged  particle  distribution 
functions  depart  sufficiently  from  Maxwellians.  Such 
scenarios  include  the  heating  of  a  plasma  by  radio¬ 
frequency  waves  or  energetic  neutral  beams,  and  the 
loss  of  particles  from  selected  areas  of  velocity  space 
[3].  The  relevant  equation  is  the  Boltzmann  equation 
with  Fokker-Planck  collision  terms  [4],  more  com¬ 
monly  referred  to  as  the  Fokker-Planck  equation. 

The  problem  is  to  solve  a  nonlinear  partial  differential 
equation  for  the  distribution  function  of  each  charged 
species  in  the  plasma  in  terms  of  six  phase  space  vari¬ 
ables  (plus  time).  However,  certain  symmetry  and 
CH'dering  assumptions  can  often  be  made,  allowing  the 


*  Work  performed  under  the  auspices  of  the  U.S.  Department  of  Energy  by  the  Lawrence  Livermore  National  Laboratory  under 
Contract  No.  W-7405 -Eng-48. 


0^186-21 13-3/90/0000/0426$01 .00  O  1990  IEEE 


426 


dimensionality  to  be  reduced  to  four — two  spatial  and 
two  velocity  coordinates.  When  the  m^netic  field  is 
uniform  or  when  the  particle  bounce  motion  operates  on 
a  time  scale  sufficiently  faster  than  other  phenomena  of 
interest,  one  spatial  variable  may  be  eliminated,  reduc¬ 
ing  the  dimensionality  to  three.  And  when  diffusion 
across  flux  surfaces  is  slow  relative  to  velocity  space 
dynamics,  the  other  spatial  coordinate  may  also  be 
eliminated,  leaving  only  two  velocity  coordinates. 

FPPAC  invokes  the  above  assumptions  and  solves  the 
complete  nonlinear  multispecies  Fokker-Planck  colli¬ 
sion  operator  for  a  plasma  in  two-dimensional  velocity 
space.  The  operator  is  expressed  in  terms  of  spherical 
coordinates  [speed  (v)  and  pitch  angle  (6)]  under  the  as¬ 
sumption  of  azimuthal  symmetry.  Provision  is  made 
for  additional  physics  contributions  (e.g.,  sources  and 
losses,  radio-frequency  heating,  electric  field  accelera¬ 
tion).  The  charged  species,  referred  to  as  general 
species,  are  assumed  to  be  in  the  presence  of  an  arbi¬ 
trary  number  of  fixed  Maxwellian  species.  The  elec¬ 
trons  may  be  treated  either  as  one  of  those  Maxwellian 
species  or  as  a  general  species.  Coulomb  interactions 
among  all  charged  species  are  considered. 


The  Fokker-Planck  Collision  Operator 

The  Fokker-Planck  collision  operator  may  be  expressed 
in  the  form 


where  /a  is  the  distribution  function  of  species  a  and  Fa 
is  a  constant  [1].  The  Rosenbluth  potentials  [4]  and 
ha  are  written  as 


«a  =  X  ^ab  J/b  (v")  Iv  -  v1  dv'  (2) 


‘•■fl  J 

In  Aab/yb(v)lv-v1-ldv'  (3) 

Under  the  assumptions  stated  above,  Eq.  (1)  may  be 
written  in  the  form 


where  the  coefficients  Aa.  Ca.  D^,  E^,  and  are  ex¬ 
pressible  as  linear  combinations  of  the  Rosenbluth  po¬ 
tentials  gi  and  Aa  and  their  various  derivatives.  The 
quantities  /a,  ga>  and  Aa  are  then  expanded  in  Legendre 
polynomials  P/(cos  6),  with  the  result  that  the  coeffi¬ 
cients  of  the  series  for  ga  and  Aa  may  be  written  in 
terms  of  moments  of  the  coefficients  of  the  series  for 
the  various  distribution  functions.  When  the  dust 
clears,  the  six  coefficients  of  £q.  (4)  are  expressed  as 
linear  combinations  of  functionals  of  the  type  A//(w), 
Ni(w),  Ri(w),  and  Ei(w),  where  w  is  a  coefficient  of  a 
Legendre  series  for  a  distribution  function;  those  four 
functionals  are  given  in  Eqs.  (S)-(8)  below; 

Ml(wXv)  =  J  H'(y)y(l^dy  (5) 

A//(h')(v)  =  (6) 

Rl(w)(y)  =  J“  w(y)yO-l)dy  (7) 

£/(w)(v)  =  r  w(y)y(^^dy  (8) 


Further  details  are  given  in  McCoy,  et  al.  [1]. 

Spatial  Representation 

A  variably  spaced  finite-difference  grid  in  v  and  6  is 
employed.  Numerical  differentiation  is  carried  out  us¬ 
ing  nearest  neighbors.  The  boundary  conditions  are 
expressed  in  an  analogous  manner.  The  resulting 
scheme  conserves  particle  density  down  to  round-off. 


Temporal  Discretization 

Equation  (4)  is  time-integrated  using  either  implicit  op¬ 
erator  splitting,  an  alternating  direction  implicit  (ADI) 
method,  or  fully  implicit  differencing.  The  former  two 
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are  appn^Hiate  for  time-dependent  simulatkms,  whereas  Tabk  1.  Comparison  of  Fortran  8X  and  Fortran  77 

the  fully  implicit  method  (which  has  not  yet  been  _ 

implemented  for  the  Connection  Machine)  is  most 

optimal  for  approaching  steady  state.  Fortran  8X 


The  Connection  Machine  2 

The  CM2  consists  of  up  to  6SS36  single  bit  processors, 
each  with  8  to  32  kbytes  of  random  access  memory. 
The  processors  are  stored  16  to  a  chip,  each  pair  of 
chips  sharing  Weitek  Boating  point  hardware.  Alto¬ 
gether  there  are  up  to  4096  chips  arranged  in  a  12- 
dimensional  hypercube.  Arithmetic  is  typically  per¬ 
formed  on  32-bit  data,  although  using  64  bits  (double 
precision)  is  allowed.  Chips  which  have  actual  64-bit 
arithmetic  have  recently  become  available.  An  impor¬ 
tant  feature  of  the  CM2  is  its  support  for  virtual  proces¬ 
sors;  that  is,  if  one  wishes  to  execute  64K  operations  in 
parallel  on  a  16K  machine,  he  may  assign  4  virtual  pro¬ 
cessors  (each  with  one  quarter  the  memory)  to  an  actual 
processor.  The  most  general  form  of  communication 
among  the  processors  is  the  router.  Nearest  neighbor 
communications,  however,  can  be  handled  much  more 
efficiently  using  a  separate  communications  mechanism 
called  the  NEWS  grid. 

The  CM2  is  not  a  stand-alone  machine.  It  is  typically 
front-ended  by  a  VAX,  a  SUN4,  or  a  Symbolics.  The 
front  end,  which  is  a  serial  machine,  stores  scalars  and 
short  arrays  and  provides  instruction  sequencing  and 
some  I/O.  One  may  program  in  either  of  three  higher- 
level  languages — CM  Fortran,  C*,  or  *LISP;  these  are 
extensions  of  Fortran,  C,  and  Common  LISP,  respec¬ 
tively.  Alternatively,  one  can  use  the  PARallel  Instruc¬ 
tion  Set  (PARIS)  for  CM  operations  together  with  either 
of  the  above  three  languages  on  the  front  end.  (Or  for 
that  matter,  PARIS  instructions  may  be  embedded  in 
the  higher  level  CM  languages.)  Of  the  various  CM 
languages,  CM  Fortran  is  the  newest,  and  it  requires  the 
VAX  front  end. 


CM  Fortran 

CM  Fortran  is  one  of  the  first  Fortran  implementations 
using  8X  constructs.  Of  particular  importance  are  the 
array  constructs,  which  are  designed  to  generate  parallel 
code.  For  example,  the  Fortran  8X  code  block  in 
Table  1  accomplishes  the  same  task  as  the  Fortran  77 
beneath  it 


WHERE  (D  JIE.  0.) 

A=EOSHIFT(B,l,l)+.5*EOSHIFT(C,2,-l) 

ELSEWHERE 

A=D 

ENDWHERE 

Fortran  77 

DO  1 1=1,100 
DOl  J=l,100 
IF(DaJ)  J^.0)THEN 
AaJ)=Ba+U)+.5*C(U-l) 

ELSE 

A(U)=D(U) 

ENDIF 
1  CONTINUE 


On  the  Connection  Machine,  by  default,  corresponding 
elements  of  arrays  of  the  same  shape  are  stored  on  the 
same  processor.  The  interprocessor  communications  in 
the  example  above  are  taken  care  of  by  the  EOSHIFT 
intrinsic  (which  in  this  case  uses  high  speed  nearest 
neighbor  communications),  and  the  V^ERE  construct 
(which  is  analogous  to  CVMGT  in  Cray  Fortran)  allows 
parallelization  of  the  loop.  The  more  compact  style 
makes  coding  easier  to  read  and  debug. 


Ctmverskm  of  FPPAC  to  the  Connection  Machine 

The  conversion  of  FPPAC  has  involved  a  number  of 

steps,  including  the  following: 

(a)  Allocating  variables  either  to  the  front  end  or  the 
CM.  As  a  rule,  large  arrays  are  stored  on  the  CM 
and  everything  else  is  on  the  front  end.  To  save 
communications  costs,  a  number  of  1-D  arrays  are 
SPREAD  into  two  dimensions  [e.g.,XZ(I4)=X(J)]. 

(b)  Minimizing  data  flow  between  the  front  end  and 
the  CM.  Special  routines  can  be  used  to  transfer 
blocks  of  data. 

(c)  Convening  to  8X  constructs.  Treating  the  Cray  as 
home  base  for  FPPAC,  the  use  of  .IF  statements 
(which  are  processed  by  the  CTSS  precompiler) 
enables  the  mimicking  of  8X  constructs  with 
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77  counterparts  side  by  side.  The  Cray  version  can 
then  be  debugged  prior  to  testing  on  the  Connec¬ 
tion  Machine. 

(d)  Replacement  of  Cray-optimized  algorithms  with 
highly  parallel  ones.  The  two  principal  areas  in¬ 
volved  are  the  Fokker-Planck  coefficient  computa¬ 
tion  and  the  time  integration  procedure. 


Computation  of  Fokker-Planck  Coefficients 

The  computation  of  the  Fokker-Planck  coefficients  in¬ 
volves  the  following  steps: 


(a)  Computation  of  Legendre  projections  of  the  distri¬ 
bution  functions. 

(b)  Computation  of  the  moments  [Eqs.  (5>-(8)]. 

(c)  Computation  of  Legendre  projections  of  the  coeffi¬ 
cient  pieces. 

(d)  Summation  of  the  Legendre  series. 


Steps  (a)  and  (d)  (which  turn  out  to  be  the  most  time- 
consuming)  are  cast  in  terms  of  matrix  multiplication. 
Step  (c)  involves  linear  combinations  of  the  various 
moments.  A  typical  term  in  step  (b)  involves  comput¬ 
ing  a  set  of  quantities  of  the  form 


aj  = 


f(v)  dv 


(9) 


for  j=l,..J.  Using  the  Cray  methodology,  this  takes  on 
the  order  of  J  sequential  (though  partially  vectorizable) 
operations.  On  the  CM,  however,  this  procedure  is  re¬ 
cast  into  one  which  takes  on  the  order  of  log  J  parallel 
steps,  thus  saving  alot  of  work.  One  computes  the 
numbers 

Pj  =  f  f(v)  dv  (10) 


and  then  combines  them  in  an  appropriate  manner.  This 
type  of  procedure  is  called  a  “SCAN.”  The  total  num¬ 
ber  of  arithmetic  operations  is  greater,  but  due  to  the  in¬ 
herent  parallelism  of  the  hardware,  the  number  of  acuial 
(parallel)  steps  is  smaller.  Furthermore,  the  communi¬ 
cations  steps,  which  involve  meshpoint  indices  differing 
by  a  power  of  two,  are  carried  out  using  either  a  small 
number  of  nearest  neighbor  jumps  or  up  to  two  “hops” 
along  the  router  (by  virtue  of  the  binary  reflected  gray 
code  ordering  of  NEWS  arrays). 


Time-Integration  Procedure 

Either  of  the  two  schemes  (implicit  operator  splitting  or 
ADI)  involve  the  solution  of  parallel  tridiagonal  sys¬ 
tems.  On  the  Cray,  vectorization  over  the  direction 
orthogonal  to  the  sweep  is  carried  out  On  the  CM,  a 
[vocedure  known  as  parallel  cyclic  reduction  [S]  is  also 
implemented.  During  each  step  of  the  ordinary  cyclic 
reduction  procedure,  both  odd  and  even  reductions  are 
carried  out  (in  parallel),  so  that  no  backfilling  is  neces¬ 
sary  once  the  reduction  procedure  is  complete.  As  with 
the  moments  computation,  this  takes  a  greater  number 
of  arithmetic  operations  but  a  smaller  number  of  paral¬ 
lel  steps. 


Cminection  Machine  Facilities 

The  debugging  of  the  Connection  Machine  version  of 
FPPAC  was  carried  out  primarily  at  the  Advanced 
Computing  Research  Facility  (ACRF)  at  Argonne  Na¬ 
tional  Laboratory.  Pre-released  features  of  the  Fortran 
compiler  were  tested  on  the  CMNS  computer  at  Think¬ 
ing  Machines  Corporation  (TMC).  Timing  compar¬ 
isons  were  carried  out  on  the  Connection  Machines  at 
the  National  Aeronautics  and  Space  Administration 
Ames  Research  Center  (NASA  Ames)  and  the  Ad¬ 
vanced  Computing  Laboratory  (ACL)  at  Los  Alamos 
National  Lateratory,  primarily  the  former.  Timings 
{vesented  in  this  work  are  for  the  NASA  Ames  facility, 
unless  otherwise  stated.  An  arrangement  has  just  been 
made  to  use  the  Connection  Machine  at  Florida  State 
University  (FSU).  A  comparison  of  the  various  Con¬ 
nection  Machines  is  given  in  Table  2. 


Table  2.  Comparison  of  Connection 
Machine  Facilities 


Facility 

No.  of 
Processors 

Hardware 

Upgrade 

Vax 

Front  End 

ACRF 

16K 

no 

8650 

TMC 

32K 

no 

6250 

NASA 

32K 

no 

6320 

ACL 

64K 

in  progress 

6420 

FSU 

64K 

yes 

6420 
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Cray  Facility 

The  Connection  Machine  calculations  are  compared  to 
ones  utilizing  the  Cray-2  “F”  machine  at  the  National 
Magnetic  Fusion  Energy  Computer  Center  (NMFECC) 
at  Lawrence  Livermore  National  Laboratory  (LLNL). 
This  Cray  has  a  static  memory  of  128K  megawords  and 
like  all  others  at  NMFECC,  runs  the  Cray  Time  Sharing 
System  (CTSS).  The  Cray  Research  Incorporated 
CFT77  compiler  is  invoked. 


Coarse  Mesh  Calculation 

The  first  case  considered  has  a  velocity-space  mesh 
consisting  of  128  points  in  v  and  64  points  in  0.  The 
Fokker-Planck  coefficients  are  expanded  in  a  5-term 
Legendre  series.  The  calculation  is  run  using  8K 
processors  and  32-bit  arithmetic.  (Note  that  64-bit 
arithmetic  is  mandatory  on  the  Cray-2.)  A  timing  com¬ 
parison  is  shown  in  Table  3.  All  times  are  given  in 
minutes.  The  headings  “CM-total”  and  “CM-active”  re¬ 
fer  to  the  total  elapsed  time  and  the  time  the  Connection 
Machine  (CM)  is  active,  respectively. 

It  can  be  seen  that  the  time  advancement  takes  8  times 
as  long  on  the  CM  as  on  the  Cray.  Assuming  a  full 
Cray  to  cost  four  times  as  much  as  a  full  CM,  this  trans¬ 
lates  to  comparable  cost  efficiency.  Note  that  the  vir¬ 
tual  processor  (VP)  ratio  for  the  CM  case  is  unity.  It 
may  also  be  observed  that  the  coefficient  computation 
takes  200  times  as  long  on  the  CM  as  on  the  Cray.  This 
is  due  to  the  fact  that  the  degree  of  parallelism  of  the 
calculation  is  at  most  640  and  to  the  fact  that  the  CM 
matrix  multiply  routine  (MATMUL)  operates  quite  in¬ 
efficiently  on  small  matrices.  Clearly,  the  Connection 
Machine  is  not  suitable  for  a  case  having  only  5  Legen¬ 
dre  polynomials. 


Longer  Legendre  Series 

The  number  of  Legendre  polynomials  is  now  increased 
from  S  to  64.  This  allows  the  coefficients  calculation  to 
have  a  degree  of  parallelism  equal  to  8K,  except  per- 

Table  3.  Timing  Comparison  for  Coarse  Mesh  Case 

Procedure  Cray-2  CM-  CM- 

_ total  active 

Coefficients  6.5x10-^  1.3x10-1  7.6x10-2 
Time  advancement  3.2  x  10“^  2.5  x  10-^  1.4  x  10“^ 


haps  for  the  matrix  multiply,  in  which  typically  a  128 
by  64  matrix  multiplies  a  64  by  64  matrix.  The  increase 
to  64  Legendre  polynomials  requires  parts  of  the 
calculation  to  be  carried  out  in  double  precision.  More 
specifically,  the  computation  of  the  moments  requires 
powers  of  v;  the  higher  the  Legendre  polynomial,  the 
higher  the  power  of  v.  Of  critical  importance  is  the 
allowable  exponent  range  available  to  represent  the 
powers  of  v,  rather  than  the  accuracy.  A  time  compari¬ 
son  is  shown  in  Table  4. 

It  can  be  seen  that  the  coefficients  computation  now 
takes  only  48  times  as  long  on  the  Connection  Machine 
as  on  the  Cray.  Virtually  all  of  the  time  (over  99%)  is 
spent  in  the  matrix  multiply.  The  double  precision  part 
of  the  calculation,  namely  the  time  it  takes  to  compute 
the  moments,  is  insignificant. 


Aside:  Exponent  Range 

It  is  important  for  users  of  the  Connection  Machine  to 
be  aware  of  an  inconsistency  in  double  precision  repre¬ 
sentation.  The  CM  double  precision  implementation 
follows  IEEE  standards  and  provides  for  an  1 1-bit  rep¬ 
resentation  of  the  exponent.  The  double  precision  rep¬ 
resentation  on  the  Vax  front  end,  however,  provides 
only  8  bits  for  the  exponent,  thereby  severely  limiting 
the  exponent  range  on  the  front  end.  One  must  exercise 
great  care  when  transferring  double  precision  data  be¬ 
tween  the  Connection  Machine  and  the  Vax  and  when 
doing  seemingly  “harmless”  double  precision  opera¬ 
tions.  It  is  of  further  interest  to  note  that  Cray  single 
precision  allocates  15  bits  for  the  exponent,  enabling 
the  Cray  to  represent  a  wider  range  of  numbers  but  at 
less  accuracy. 


Matrix  Multiplication 

The  speed  of  the  CM  Fortran  matrix  multiply 
(MATMUL)  limits  the  performance  of  the  Connection 
M^hine  in  the  above  case  with  64  Legendre  polynomi¬ 
als.  It  is  therefore  of  interest  to  compare  matrix  multi¬ 
ply  efficiency  as  a  function  of  matrix  size.  Such  a 


Table  4.  Timing  Comparison  for  Coarse 
Mesh  Case  with  Long  Legendre  Series 


Procedure 

Cray-2  CM- 

CM- 

total 

active 

Coefficients 

6.7  X  10-5  3_2  X  10-^ 

2.3  X  10-1 

Time  advancement  3.2  x  10^  2.5  x  10"^ 

1.4  X  10-5 
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comparison  is  shown  for  square  matrices  in  Table  S. 
The  arithmetic  uses  32  bits,  and  16K  processors  are 
utilized. 

It  can  be  seen  that  the  CM2  matrix  multiply  perfor¬ 
mance  is  very  strtmgly  dependent  on  the  size  of  the  ma¬ 
trix.  At  order  1024,  the  CM  (in  single  precision)  out¬ 
performs  the  Cray.  However,  at  order  128,  a  case  in 
which  the  number  of  matrix  elements  equals  the  number 
of  iHocessors,  the  CM  operates  6  times  more  slowly. 


Fine  Mesh  Calculation 

The  mesh  is  now  expanded  to  S12  points  in  v  and  256 
in  0.  The  Legendre  series  for  computing  the  Fokker- 
Planck  coefficients  has  64  terms  (the  upper  limit  for  a 
512-point  v-mesh  is  approximately  111).  The 
Connection  Machine  calculation  is  carried  out  with  16K 
processors.  Thus,  the  VP  ratio  of  the  time-advancement 
phase  is  8  (instead  of  1).  A  timing  comparison  is  shown 
in  Table  6. 


It  can  be  seen  that  the  time-advancement  phase  now  ex¬ 
ecutes  almost  as  fast  on  the  Connection  Machine  as  on 
the  Cray.  This  is  due  primarily  to  the  higher  VP  ratio. 
Assuming  only  32  bits  to  be  required,  this  translates  to  a 
factor  of  3  cost-effectiveness  in  favor  of  the  CM.  Fur¬ 
thermore,  the  coefficients  computation  executes  6  times 
faster  on  the  Connection  Machine,  due  both  to  the 
higher  VP  ratio  and  to  the  superior  performance  of 
MATMUL  for  larger  matrices.  In  fact  the  matrix  multi¬ 
plication  phase,  although  it  still  dominates,  now  is  re¬ 
sponsible  for  only  85%  of  the  coefficients  calculation. 
It  may  also  be  observed  that  the  Connection  Machine 


Table  5.  Matrix  Multiply  Performance 


Order 

Cray-2  Mflops 

CM2  Mflops 

64 

431 

10 

128 

438 

76 

256 

446 

180 

512 

452 

319 

1024 

453 

577 

2048 

454 

too  large  for  16K 

Table  6.  Timing  Comparison  for  Fine  Mesh  Case 


Procedure 

Cray-2  CM-  CM- 

total  active 

Coefficients 

1.8  X  10-1  3.0  X  10-2  2.8  X  10-2 

Time  advancement  7.3  X  10~^  1.0x10“^  1.0x10"^ 

experiences  relatively  less  idle  time;  this  is  due  to  the 
higher  VP  ratio. 

There  are  instances  where  the  precise  change  in  density 
will  strongly  affect  the  ensuing  physics.  In  such  cases 
64-bit  arithmetic  is  required.  A  comparison  of 
“standard”  and  double  precision  is  given  in  Table  7. 
Recall  that  the  “standard”  calculation  uses  single  preci¬ 
sion  almost  everywhere,  but  utilizes  double  precision 
when  computing  the  moments. 

It  can  be  seen  that  the  time  advancement  in  double  pre¬ 
cision  takes  about  9  times  longer  than  the  single  preci¬ 
sion  version.  That  is  because  the  Connection  Machines 


Table  7.  Standard  Calculation  versus  Full  Double  Precision  for  Fine  Mesh  Case. 
(The  initial  density  and  energy  are  6.8  x  and  49.2228,  respectively.) 


(a)  Procedure 

CM-total  (Std.) 

CM-active  (Std.) 

CM-total  (D.P.) 

CM-active  (D.P.) 

Coefficients 

3.0  X  10-2 

2.8  X  10-2 

5.8  X  10+® 

5.8  X  lO+H 

Time  advancement 

1.0  X  10-2 

1.0  X  10-2 

9.2  X  10-2 

9.2  X  10-2 

(b)  Precision 

Final  Density 

Final  Energy 

Standard 

7.65638  X  10+13 

45.9036 

Double  Precision 

7.64600x10+13 

45.9139 

Cray 

7.64600  X  10+13 

45.9139 
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used  for  these  tests  have  available  only  a  software  dou¬ 
ble  precision  implementation.  It  is  intended  to  run  tests 
on  a  Connection  Machine  having  64-bit  arithmetic  as 
soon  as  one  becomes  accessible.  The  time  to  compute 
the  coefficients  increases  by  a  factor  of  about  200,  due 
to  both  the  software  implementation  of  double  precision 
and  the  lack  of  a  high  speed  double  precision  matrix 
multiply  package. 

It  can  also  be  seen  that  the  double  precision  answers  are 
in  better  agreement  with  those  of  the  Cray,  which  is  to 
be  expected.  Since  this  test  case  is  run  for  only  ten 
timesteps,  very  little  inference  can  be  drawn  from  the 
difference  in  accuracy. 

Timing  on  the  Connection  Machine 

Timing  Fortran  programs  on  the  Connection  Machine  is 
more  of  an  art  than  a  science,  at  least  in  comparison  to 
timing  Cray  code  blocks.  That  is  because  the  Vax  real 
time  clock  has  a  very  low  resolution  (milliseconds)  and 
because  it  measures  the  time  consumed  by  all  pro¬ 
cesses,  not  just  the  one  in  question.  Furthermore,  the 
CM-active  time  is  computed  by  subtracting  the  CM-idle 
time  from  the  total  elapsed  time  as  measured  on  the 
front  end.  Hence,  the  active  time  is  no  more  accurate 
than  the  total  elapsed  time. 

TMC  advises  users,  when  performing  timing  tests,  to 
(a)  use  a  lightly  loaded  front  end  system,  (b)  time  code 
blocks  whose  duration  is  1  to  5  seconds,  and  (c)  run  the 
code  segment  at  least  5  times  and  use  the  minimum 
value  reported  [6]. 

Because  the  CM  is  in  essence  a  slave  of  the  front  end, 
its  overall  performance  will  vary  with  the  front  end 
model.  Generally  speaking,  the  systems  at  TMC, 
NASA  Ames,  and  ACL  give  roughly  comparable  per¬ 
formance;  the  Connection  Machine  at  ACRF  is  not  as 
robust. 


Toward  the  Future 

The  version  of  FPPAC  running  on  the  Connection  Ma¬ 
chine  contains  most  of  the  important  features  of  the 
Cray  version.  Yet  to  be  implemented  are  capability  to 
u-eat  a  multiple  number  of  species  and  a  fully  implicit 
finite-difference  solver.  The  mullispccies  capability 
was  left  out  of  this  version  because  of  compiler  limita¬ 
tions  having  to  due  with  processor  allocation  for  multi¬ 
ply  dimensioned  arrays.  These  limitations  have  re¬ 
cently  been  removed,  however,  and  generalization  to 


multiple  species  should  be  simple  and  straight-forward. 
Provision  of  a  fully  implicit  solver  requires  a  parallel 
routine  to  compute  the  nine-banded  operator  matrix  and 
a  routine  to  solve  that  matrix  (e.g.,  preconditioned  con¬ 
jugate  gradient).  Such  a  procedure  is  likely  to  dominate 
the  calculation,  so  that  the  overall  code  performaiKe 
will  strongly  reflect  that  of  the  fully  implicit  solver. 

FPPAC  was  chosen  for  conversion  because  it  is  simple 
and  because  it  is  representative  of  more  involved  plas¬ 
ma  Fokker-Planck  models.  Present  state-of-the-art 
Fokker-Planck  calculations  can  treat  two  spatial  dimen¬ 
sions  (one  of  them  averaged  over  the  particle  bounce 
motion)  as  well  as  two  velocity  dimensions  and  contain 
a  whole  host  of  other  physics  as  well.  Extensions  of 
this  work  to  more  realistic  scenarios  is  under 
investigation. 
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Abstract 

The  BBN  TC2000  is  a  multiple  instruction,  multi¬ 
ple  data  (MIMD)  machine  that  combines  a  physically 
distributed  memory  with  a  logically  shared  memory 
programming  environment  using  the  unique  Butter¬ 
fly  switch.  Particle-In-Cell  (PIC)  plasma  simulations 
model  the  interaction  of  charged  particles  with  elec¬ 
tric  and  magnetic  fields.  This  presentation  describes 
the  implementation  of  both  a  1-D  electrostatic  and  a 
2  1/2-D  electromagnetic  PIC  (particle-in-cell)  plasma 
simulation  code  on  a  BBN  TC2000.  Performance  is 
compared  to  implementations  of  the  same  code  on  the 
shared  memory  Sequent  Balance  and  distributed  mem¬ 
ory  Intel  iPSC  hypercube. 

Introduction 

In  recent  years  the  traditional  model  of  single-processor, 
sequential  computer  archtecture  has  become  known  as 
the  von  Neumann  bottleneck  [13].  A  single  CPU  issu¬ 
ing  sequential  requests  over  a  bus  to  memory,  and  the 
memory  responding  to  one  request  at  a  time  creates 
the  bottleneck.  In  response  to  this  problem,  designers, 
seeking  alternatives  to  the  von  Neumann  architecture, 
have  developed  a  wide  variety  of  parallel  architectures 
and  interconnection  technologies.  The  BBN  TC2000  is 
a  multiple  instruction,  multiple  data  (MIMD)  machine 
that  combines  a  physically  distributed  memory  with  a 
logically  shared  memory  programming  environment  us¬ 
ing  the  unique  Butterfly  switch.  Processors  are  con¬ 
nected  through  the  Butterfly  switch  network.  Data  may 
he  local  to  a  processor,  or  remote  (i.e.,  fetched  through 
the  switching  network  from  another  proces.sor). 

This  presentation  includes  a  discussion  of  the  implc- 
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mentation  of  both  a  1-D  and  a  2  1/2-D  PIC  (particle- 
in-cell)  plasma  simulation  code  on  a  BBN  TC2000  at 
Argonne  National  Laboratory’s  Advanced  Computing 
Research  Facility.  Performance  is  compared  to  imple¬ 
mentations  of  the  same  code  on  the  shared  memory  Se¬ 
quent  Balance  and  distributed  memory  Intel  iPSC  hy¬ 
percube. 

Architecture  Overview 

The  most  commonly  used  classification  scheme  in  par¬ 
allel  computing  is  that  of  Flynn,  which  is  based  on  the 
concepts  of  instruction  streams  and  data  streams. 

SISD  single  instruction,  single  data 

SIMD  single  instruction,  multiple  data 

MISD  multiple  instruction,  single  data 

MIMD  multiple  instruction,  multiple  data 

The  category  of  MIMD  architecture  has  been  subdi¬ 
vided  into  the  categories  of  distributed  and  shared  mem¬ 
ory  [13].  In  addition,  both  distributed  and  shared  mem¬ 
ory  may  be  further  categorized  by  implementation,  ac¬ 
tual  physical  (hardware)  implementation  or  logical {soft- 
ware)  implementation  [2].  The  BBN  TC2000  combines 
hardware  (the  Butterfly  switch)  with  software  to  im¬ 
plement  a  logically  shared  memory  programming  envi¬ 
ronment  on  top  of  physically  distributed  memory.  The 
logically  shared  memory  model  provides  ease  of  pro¬ 
gramming,  while  the  physical  distribution  of  memory 
permits  expandability.  Physically  shared  memory  sys¬ 
tems  are  limited  by  a  single  bus  with  fixed  bandwidth 
interconnecting  processors  and  memory.  Splitting  mem¬ 
ory  into  physically  disparate  modules,  and  providing 
multiple  interconnection  paths  between  processors  and 
memory  modules,  allows  expandability,  at  least  up  to 
several  hundred  processors. 

The  nX  operating  system  provides  the  cluster  mech¬ 
anism  for  designating  a  number  of  function  boards  as  a 
computing  resource.  The  system  cluster  includes  all 
the  function  boards  of  a  machine.  A  number  of  func¬ 
tion  boards  are  allocated  to  a  public  cluster,  used  for 
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Figure  1:  Function  Board  Components 

compiling,  editing,  etc.,  and  an  I/O  cluster  contain¬ 
ing  the  nX  master  function  board.  The  cluster  concept 
provides  a  flexible,  multi-user  environment. 

The  main  components  of  the  TC2000  architecture  are 
function  boards  (8  to  504),  connected  by  a  Butterfly 
switch.  A  function  board  must  include  a  switch  inter¬ 
face,  T-bus  and  TCS  (Test  and  Control  System)  slave. 
To  this  minimum  configuration  may  be  added  a  proces¬ 
sor,  4-32  Mbytes  of  memory  and/or  a  VMEbus  interface 
[3].  Processor  boards  are  based  on  the  Motorola  88000, 
which  consists  of  an  88100  RISC  CPU  and  two  88200 
cache/memory  management  units,  one  for  data  and  an¬ 
other  for  instructions.  Connecting  the  components  on 
the  function  board  is  the  transaction  bus,  or  T-bus,  a 
32  bit-wide  memory  bus  with  80  Mbytes/sec  peak  band¬ 
width.  Figure  1  illustrates  the  major  components  of  a 
function  board. 

Each  processor  executes  an  independent  sequence  of 
instructions,  referencing  data  as  needed.  Virtual  ad¬ 
dresses  are  translated  by  the  memory  management  unit 
into  physical  addresses,  which  are  in  turn  tran.slated  by 
the  CPU  interface  into  System  Physical  addresses.  The 
interprocessor  network  allows  each  processor  to  share 
some  or  all  of  the  system  memory.  Memory  that  is 
physically  local  to  Processor  #1  is  considered  remote 
by  Processor  #2,  and  vice  versa.  Code,  constants,  and 
stack  variables  are  stored  in  local  memory,  not  fetched 
across  the  network.  The  application  data  is  usually 
spread  across  the  memory  of  the  machine  by  the  Uni¬ 
form  System,  but  may  be  placed  more  explicitly  by  the 
programmer. 

The  unique  component  of  the  TC2000  architecture  is 
the  Butterfly  switch.  The  Butterfly  switch  implements 
a  packet  switching  network  of  8-bit  wide  switch  paths, 
with  a  bandwidth  of  38  Mbytes/sec.  Each  processor 


is  connected  to  the  switch  through  an  interface  with 
two  ports.  One  port  is  used  to  access  other  function 
boards,  and  the  other  is  used  to  service  requests  from 
other  boards.  A  remote  memory  access  is  one  made 
over  the  switch;  a  local  memory  access  is  one  that  ac¬ 
cesses  its  own  function  board  directly.  Multiple  memory 
accesses  in  parallel  are  supported  by  the  bidirectional 
switch  paths.  The  route  a  message  takes  over  the  net¬ 
work  is  determined  by  the  first  9  bits  of  its  physical  ad¬ 
dress.  The  number  of  stages  in  the  switching  network 
is  determined  by  the  number  of  ports  to  be  supported 
(i.e.,  2-stage  switch  supports  64  ports,  3-stage  supports 
512  ports). 

A  message  encountering  contention  within  the  switch 
backs  out,  releasing  resources  until  it  has  returned  to 
the  requestor.  A  rejected  message  is  retransmitted  ac¬ 
cording  to  a  backoff  algorithm  [3].  After  a  certain  num¬ 
ber  of  rejections  and  retransmissions,  the  priority  of 
a  message  is  promoted  to  that  of  an  express  message, 
which  will  then  be  successfully  delivered  to  its  destina¬ 
tion.  Another  method  for  controlling  switch  contention 
is  a  connection  time  limit  imposed  on  each  path.  In  ad¬ 
dition  to  software  controls,  some  configurations  of  the 
TC2000  switch  provide  alternate  paths.  When  a  con¬ 
flict  occurs,  the  message  returns  to  its  source  node  and 
is  retransmitted  on  an  alternate  path. 

This  strategy  of  maintaining  redundant  paths  pre¬ 
vents  the  message  from  remaining  inside  the  switch  for 
a  long  time  and  potentially  conflicting  with  other  in¬ 
coming  traffic.  In  one  survey,  Larrabee,  Pennick  and 
Stern  calculate  switch  contention  overhead  to  be  one  to 
five  per  cent  of  total  run  time,  although  it  is  application 
dependent.  They  determined  that  message  time  is  nor¬ 
mally  dominated  by  the  time  required  for  the  message 
to  pass  through  the  switch  serially,  not  by  contention 
for  switching  paths  [1], 

Similar  results  have  been  reported  by  various  re¬ 
searchers  at  the  Unive’-sity  of  Rochester  on  an  earlier 
system,  the  GPIOOO.  LeBlanc  reported  switch  con¬ 
tention  of  2%  and  memory  contention  of  3%.  Exper¬ 
iments  using  four  times  as  many  memories  as  proces¬ 
sors  produced  performance  increases  of  30%  [9].  Mellor- 
Crummey  also  noted  that  increasing  data  locality  can 
significantly  improve  performance  [12]. 

PIC  Codes 

Particle-ln-Cell  (PIC)  plasma  simulations  model  the  in¬ 
teraction  of  charged  particles  with  electric  and  magnetic 
fields.  Research  problems  in  three  dimensions  often  in¬ 
volve  tens  to  hundreds  of  thousands  of  particles,  re¬ 
quiring  hours  of  CPU  time  on  vector  supercomputers. 
Problem  size  alone  has  motivated  the  search  for  an  effi¬ 
cient,  multi-processing  solution.  Previous  work  on  PIC 
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plasma  simulation  codes  on  advanced  architecture  com¬ 
puters  has  been  surveyed  by  Walker  [14].  Development 
of  an  efficient,  multi-processing  solution  has  been  slowed 
by  the  inhomogeneous  nature  of  the  problem. 

The  interaction  of  particles,  which  may  move 
throughout  the  entire  simuleition  space  in  a  non-uniform 
manner,  with  field  quantities  maintained  on  a  fixed  spa¬ 
tial  grid,  creates  the  conflict  in  problem  distribution. 
Each  peuticle  is  defined  by  its  position  in  space,  veloc¬ 
ity,  mass  and  charge  density.  Electric  and  magnetic  field 
quantities  are  discretized  to  each  grid  point.  Each  cycle 
in  a  PIC  simulation  consists  of  four  main  steps: 

Assignment  Phase  Charge  and  current  density  for 
each  particle,  at  a  given  position  and  velocity,  are 
collected  at  each  grid  point,  based  on  a  weighting 
algorithm. 

Field  Solve  Phase  The  electric  and  magnetic  field 
equations  are  solved  at  each  grid  point. 

Interpolation  Phase  Field  quantities  are  interpo¬ 
lated  to  each  particle’s  position,  again  based  on 
a  weighting  algorithm. 

Particle  Push  Phase  Forces  on  the  particles  are 
found  using  the  electric  and  magnetic  fields  in  the 
Newton-Lorentz  equation  of  motion,  and  used  to 
determine  the  particle’s  new  position  and  velocity. 

Previous  research  has  focused  on  problem  decomposi¬ 
tion.  Lubeck  and  Faber  [II]  discuss  implementation  of 
a  2-D,  electrostatic  code  with  static  decomposition  of 
both  particles  and  fields,  on  a  hypercube.  Problems 
with  this  approach  include  the  communication  time 
needed  to  transfer  information  between  the  divided  grid 
of  the  field  calculations  and  the  replicated  grid  of  the 
particle  push  phase.  An  early  solution  by  Walker  [15] 
was  ba^ed  on  static  decomposition  with  quasi-static, 
global  communication  routines.  A  problem  with  this 
solution  is  the  large  amount  of  memory  required  for 
the  communication  tables.  Liewer,  Decyk,  Dawson,  and 
Fox  solve  the  load  balance  problem  by  using  separate 
decompositions  for  particles  and  field  quantities  [10]. 
Two  distinct  spatial  decompositions  requires  global  re¬ 
distribution  of  data  twice  during  each  time  step. 

For  this  study,  a  1-D  electrostatic  code  and  a  2  1/2- 
D  electromagnetic  code  were  developed  based  on  the 
widely  used  1-D  teaching  code  ESI  [5].  In  the  1-D  ver¬ 
sion,  the  original  FFT  Poisson  solver  was  replaced  with 
an  iterative  method,  successive  over-relaxation  (SOR), 
to  treat  more  general  boundary  conditions.  Static  de¬ 
composition  with  grid  replication  are  used  in  both  test 
codes  to  facilitate  implementation  on  both  distributed 
memory  and  shared  memory  architectures. 


Initialization 

For  each  particle,  assign  spatial  coordinates, 
velocity,  and  contribution  to  overall 
charge  density.  Electric  (and  magnetic) 
fields  are  initialized  on  the  grid. 

Body  of  Simulation 

particle  routine 

Calculate  new  spatial  coordinates 
and  velocity  for  each  particle,  based 
on  current  coordinates,  velocity, 
field  quantities  and  charge  densities. 
Calculate  new  charge  densities  at 
each  grid  point  based  on  the  charge 
contributed  by  each  particle. 

field  solver 

Update  the  electric  (and  magnetic) 
fields  at  each  grid  point. 

Output  Final  Results 


Figure  2:  PIC  Algorithm 

Implementation 

Implementation  of  the  1-D  and  2  1/2-D  PIC  codes  was 
nearly  identical.  Both  programs  follow  the  basic  al¬ 
gorithm  shown  in  Figure  2.  The  main  difference  is  in 
the  solution  of  equations  for  the  electric  and  magnetic 
fields.  Poisson’s  equation  is  solved  by  successive  over¬ 
relaxation  (SOR)  in  the  1-D  version.  The  2  1/2-D  ver¬ 
sion  implements  a  time  dependent  solution  to  the  full 
set  of  Maxwell’s  equations  for  electromagnetic  fields. 

The  primary  data  structures  in  this  implementation 
are  shown  in  Table  1.  For  each  particle,  position  and  ve¬ 
locity  components  are  stored  in  the  particle  data  struc¬ 
ture.  For  each  grid  point,  electric  and  magnetic  field 
components  are  stored  in  the  field  data  structure.  In 
addition,  the  grid  data  structure  stores  momentum,  ki¬ 
netic  energy  and  charge  density  for  each  grid  point. 
Data  structures  in  this  implementation  consist  of  large 
(tens  of  thousands  of  elements),  multi-dimensional  ar¬ 
rays,  most  of  which  arc  stored  in  (private)  common 
blocks. 

Figure  3  illustrates  the  interaction  of  the  data  struc¬ 
tures  in  the  algorithm.  Current  particle  data  and  field 
data  are  used  to  calculate  new  particle  data  in  the  parti¬ 
cle  routine.  Grid  data  is  generated  for  the  new  particle 
data,  based  on  a  weighting  algorithm  for  each  particle’s 
contribution  to  the  grid.  Current  field  data  is  combined 
with  the  new  grid  data  to  calculate  new  values  for  the 
field  data  components. 
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Table  1:  Primary  Data  Structures 


particle  data  (for  each  particle) 

l-D 

(x,Ui,Vy) 

position  and  velocity 

2^-D 

((i,y),t'i.Vy.v«) 

position  and  velocity 

field  data  (for  each  grid  point) 

l-D 

(ex,b,) 

electric  and  magnetic  fields 

2^-D 

{c,,ey,c,) 
i^xt  fry*  ^>2) 

electric  fields 
magnetic  fields 

grid  data  (for  each  grid  point) 

(p,  fcc,  p> 

momentum,  kinetic  energy  and 
charge  density 

SUBROUTINE  particle 
INTEGER  lockvar 

CALL  shareblk  (%loc(field),  fieldsize) 

PARALLEL  REGION,  REPLICATE  (...) 

LOCAL  xloc,  vxioc,  momentum,  energy,  rho 

CALL  load  (pid,  xloc,  vxioc) 

calculate  new  xloc,  vxioc,  momentum,  energy,  rho 

CALL  store  (pid,  xloc,  vxioc) 

CALL  uslock  (lockvar,  0) 

CALL  updatejnemory  (momentum) 

CALL  usunlock  (lockvar) 

CALL  uslock  (lockvar,  0) 

CALL  updatejnemory  (energy) 

CALL  usunlock  (lockvar) 

CALL  uslock  (lockvar,  0) 

CALL  updatejnemory  (rho) 

CALL  usunlock  (lockvar) 

END  PARALLEL 
END 


Figure  3:  Data  Structure  Usage 

In  the  original  algorithm,  the  particle  routine  ac¬ 
counts  for  more  than  90%  of  the  computation  time. 
Therefore,  parallelization  was  limited  to  that  routine. 
Load  balancing  is  attained  by  the  assignment  of  equal 
numbers  of  particles  to  each  processor.  The  set  of  par¬ 
ticles  assigned  to  a  processor  may  be  located  anywhere 
in  the  grid.  Electric  and  magnetic  field  information  for 
the  entire  grid  must,  therefore,  be  made  available  to  all 
processors. 

Temporary  storage  on  each  processor  is  used  to  local¬ 
ize  data  references.  At  the  beginning  of  each  time  step, 
particle  position  and  velocity  are  copied  to  local  stor¬ 
age  on  each  processor  for  fast  access.  Updated  values 
are  written  to  the  global  data  structures  at  the  end  of 
the  routine  to  be  available  for  the  next  time  step.  In 
addition,  the  largest  data  struciiires  are  distributed  in 
memory  over  all  available  function  boards  by  the  scat¬ 
ter  mechanism.  Distributing  data  is  done  to  reduce 
memory  contention,  or  hot  spots,  created  when  multi¬ 
ple  processors  are  trying  to  access  memory  on  a  single 
function  board  concurrently. 

Figure  4  details  the  steps  in  the  particle  algorithm. 
Fortran  language  extensions  provide  the  constructs  used 
for  parallel  programming  and  memory  management. 
Synchronization,  and  additional  processor  and  memory 
management  are  provided  by  the  Uniform  System  li¬ 


Figure  4:  Particle  Algorithm 


brary  routines.  The  PARALLEL  REGION  encloses  a 
block  of  code  which  is  executed  once  by  each  available 
processor.  The  REPLICATE  option  copies  simple  vari¬ 
ables  (not  arrays)  to  each  processor.  The  LOCAL  dec¬ 
laration  is  used  to  create  variables  that  are  private  to 
the  PARALLEL  REGION. 

The  particle  routine  begins  by  getting  current  val¬ 
ues  for  particle  and  field  data  Electric  and  magnetic 
field  information  is  copied  to  each  processor  in  a  sin¬ 
gle  step  using  the  shareblk  mechanism.  Particle  data 
is  loaded  into  local  storage  on  each  processor.  No  syn¬ 
chronization  is  needed  because  each  processor  is  loading 
a  unique  block  of  data  (the  block  is  identified  by  the  pid, 
processor-id,  variable).  Local  data  storage  is  used  for 
the  calculation  of  momentum,  kinetic  energy,  and  cur¬ 
rent  density  components.  The  partial  results  from  each 
proce.s.sor  are  summed  into  the  global  storage  for  each 
component  at  the  end  of  the  particle  phase.  Uslock  and 
usunlock  routines  enforce  critical  regions  that  prevent 
data  corruption  due  to  race  conditions. 


Results 

The  original  implementation  for  the  TC2000  using  par¬ 
allel  FORTRAN  extensions  produced  less  than  half  the 
expected  relative  speedup.  Additional  work  to  op- 
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Figure  5:  Steps/Second  for  the  1-D  PIC  Code 

timize  both  processor  and  memory  management  im¬ 
proved  speedup  for  a  small  number  of  processors,  but 
did  not  extend  successfully  to  larger  numbers  of  pro¬ 
cessors.  Memory  management  was  improved  by  us¬ 
ing  Uniform  System  routines  to  initialize  the  configu¬ 
ration  parameters  and  allocate  space.  Processor  perfor- 
m£tnce  weis  improved  by  reducing  overhead  and  increas¬ 
ing  granularity. 

The  shareblk  mechanism  is  an  efficient  way  to  copy 
read-only  data  to  all  processors.  Time  required  to  copy 
the  large  field  data  structure  in  a  two-processor  test  was 
25%  of  the  total  execution  time.  An  experimental  ver¬ 
sion  eliminated  the  shareblk  copy  by  using  a  shared, 
global  data  structure.  This  version  increased  execution 
time  in  a  two-processor  test  by  50%  and  was  not  con¬ 
sidered  further. 

Another  major  execution  cost  was  the  use  of 
lock/unlock  routines  for  synchronization.  An  experi¬ 
mental  version  with  no  lock/unlock  routines  executed 
10-25%  faster  for  two  to  eight  processors  than  the  ver¬ 
sion  with  locks.  However,  to  ensure  program  correct¬ 
ness,  it  is  necessary  to  replace  the  local  data  structures 
with  a  shared  common  block.  A  dimension  is  added  to 
each  array  to  store  processor  specific  data.  Summing 
each  processor’s  contribution  is  then  done  outside  the 
parallel  region  and  protected  from  data  corruption.  Re¬ 
placing  local  data  storage  with  shared  storage  added  an 
execution  time  cost  of  10-25%  for  two  to  eight  proces¬ 
sors.  The  end  result  was  an  unchanged  execution  time. 
This  version  was  also  not  considered  further. 

Overhead  to  set  up  the  parallel  region  in  the  code 
includes  the  cost  of  duplicating  private  common  blocks 
for  every  processor.  It  was  observed  that  reducing  the 
size  of  arrays  in  those  blocks  produced  small  increases 
in  efficiency.  Increasing  the  amount  of  work  being  done 
by  each  processor  by  increasing  the  number  of  particles 
also  improved  efficiency. 

Figure  5  compares  perform2mce  of  the  BBN  TC2000 


1 

.9 
.8 

relative  'I 
efficiency 

.5 
.4 
.3 
.2 

0  5  10  15  20  25 

number  of  processors 

Figure  6:  Relative  Efficiencies  for  the  1-D  PIC  Code 

with  the  shared  memory  Sequent  Balance  21000  and  the 
distributed  memory  Intel  iPSC  hypercube  for  the  1-D 
PIC  code.  Timing  information  for  the  1-D  shared  mem¬ 
ory  version  on  the  Balance  is  reported  by  Campbell  and 
Sturtevant  [8].  Results  for  the  1-D  distributed  memory 
version  on  the  iPSC  is  reported  by  Campbell  [7].  The 
test  problem  models  a  two-stream  instability  with  4096 
particles  in  each  of  the  two  beams.  The  simulation  exe¬ 
cutes  100  time  steps  over  320  grid  cells.  Test  problems 
run  on  the  TC2000  used  32768  particles  in  each  of  the 
two  beams.  The  faster  processors  of  the  TC2000  re¬ 
quired  a  larger  granularity  for  efficient  performance. 

The  number  of  steps/second  was  calculated  by  di¬ 
viding  the  number  of  time  steps  by  the  total  time, 
in  seconds,  for  the  problem.  TC2000  steps/second 
were  multiplied  again  by  65538/8192  to  produce 
steps/second/8192  particles. 

The  shared  memory  Balance  demonstrates  the  best 
performance.  The  number  of  steps/sec  on  the  iPSC 
was  good  for  a  very  small  number  of  processors,  but 
decreased  rapidly  for  a  larger  number  of  processors. 
Replicating  the  grid  on  all  the  nodes  forced  global  com¬ 
munication  costs  to  be  prohibitive.  The  TC2000  per¬ 
formance  also  is  very  good  initially,  decreasing  as  the 
number  of  processors  increases. 

Relative  efficiences  were  calculated  as  time  for  1  pro¬ 
cessor  /  time  for  N  processors  /  N.  In  the  graph  of  rela¬ 
tive  efficiences  shown  in  Figure  6,  none  of  the  machines 
maintains  a  good  relative  efficiency.  The  Balance  again 
demonstrates  better  results  than  both  the  TC2000  and 
iPSC. 

Figure  7  compares  performance  of  the  BBN  TC2000 
with  the  shared  memory  Sequent  Balance  21000  for  the 
2  1/2-D  PIC  code.  Timing  information  for  the  2  1/2-D 
shared  memory  version  on  the  Sequent  is  reported  by 
Campbell  [6].  The  test  problem  models  a  two-stream 
instability  with  2048  particles  in  each  of  the  two  beams. 
The  simulation  executes  100  time  steps  over  a  grid  of  10 
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Figure  7:  Steps/Second  for  the  2^-0  PIC  Code 
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X  70  cells.  Test  problems  run  on  the  TC2000  used  32768 
particles  in  each  of  the  two  beams  for  better  efficiency. 

The  number  of  steps/second  was  calculated  by  di¬ 
viding  the  number  of  time  steps  by  the  total  time, 
in  seconds,  for  the  problem.  TC2000  steps/second 
were  multiplied  again  by  65538/4096  to  produce 
step8/second/4096  particles. 

TC2000  performance  in  steps/second  is  better  than 
the  Sequent.  In  the  graph  of  relative  efficiences  shown  in 
Figure  8,  both  machines  demonstrate  better  efficiency 
than  in  the  1-D  case.  This  graph  illustrates  the  impor¬ 
tance  of  relative  efficiencies.  The  superior  relative  effi¬ 
ciency  of  the  Balance  is  not  demonstrated  in  the  graph 
of  steps/second. 

Conclusions 

Initial  performance  of  each  PIC  code  on  the  BBN 
TC2000  was  somewhat  below  expected  levels.  Triv¬ 
ial  example  problems  produced  nearly  linear,  relative 
speedup.  One  such  problem  is  the  program  to  calculate 
T,  originally  used  as  a  parallel  test  case  by  Babb  [1],  A 
more  complex  example  is  a  grid-based  algorithm  writ¬ 
ten  by  Dr.  W.  Jeffrey  to  solve  a  fluid-flow  problem  (4]. 
However,  it  is  a  very  small  problem  (arrays  with  hun¬ 
dreds  of  elements),  compared  to  the  PIC  codes.  BBN 
suggests  that  it  may  help  to  pau:k  code  and  data  such 
that  each  fits  into  a  cache.  The  two  PIC  codes  imple¬ 
mented  must  be  much  larger  than  the  cache  size  to  solve 
physically  interesting  problems. 

The  PIC  Eilgorithm  implemented  was  based  on  a 
shared-memory  model  and  did  not  map  well  to  the  ar¬ 
chitecture  of  the  TC2000.  The  high  costs  of  copying 
very  large  blocks  of  read-only  data  and  sharing  very 
large  global  data  structures  directly  affected  perfor¬ 
mance.  An  algorithm  based  on  a  message-passing  model 
with  its  natural  locality  of  data  references  may  be  an 
effective  solution.  Communications  between  processors 
could  be  kept  to  a  minimum.  More  information  could 
be  retained  on  each  proces.sor  from  one  time  step  to 
the  next,  decreasing  requirements  for  global  data  struc¬ 
tures.  Investigation  of  other  models,  such  as  message¬ 
passing,  is  a  subject  for  future  research. 


number  of  processors 


Figure  8:  Relative  Efficiencies  for  the  2^-D  PIC  Code 
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Abstract 

We  have  implemented  a  2D  electrostatic  plasma  particle  in 
cell  (PIC)  simulation  code  on  the  CaltechUPL  Mark  Illfp 
Hypercube.  The  code  simulates  plasma  effects  by 
evolving  in  time  the  trajectories  of  thousands  to  nullions 
of  charged  particles  subject  to  their  self-consistent  fields. 
Each  particle's  position  and  velocity  is  advanced  in  time 
using  a  leap  frog  method  for  integrating  Newton’s 
equations  of  motion  in  electric  and  magnetic  fields.  The 
electric  field  due  to  these  moving  charged  particles  is 
calculated  on  a  spatial  grid  at  each  time  step  by  solving 
Poisson's  equation  in  Fourier  space.  These  two  tasks 
represent  the  largest  part  of  the  computation.  To  obtain 
efficient  operation  on  a  distributed  memory  parallel 
computer,  we  are  using  the  General  Concurrent  PIC 
(GCPIC)  algorithm  [1]  previously  developed  for  a  ID 
parallel  PIC  code. 

Introduction 

In  previous  work  we  have  demonstrated  the  efficiency  of  a 
ID  PIC  code  on  the  JPL/Caltech  Mark  III  Hypercute  [1] 
We  have  now  extended  our  work  to  a  2D  implementation 
of  an  electrostatic  PIC  code  for  plasma  simulations,  using 
the  General  Concurrent  PIC  (GCPIC)  algorithm  [2].  The 
GCPIC  algorithm  is  a  generalization  of  the  techniques 
employed  in  the  ID  parallel  PIC  code  which  is  applicable 
to  many  different  parallel  architectures.  In  this  paper  we 
describe  its  application  to  the  implementation  of  the  well 
benchmarked  2D  electrostatic  PIC  code  BEPSJ  [3]  on  the 
Mark  HI  Hypercube. 

A  plasma  PIC  code  simulates  the  self  consistent 
interactions  of  thousands  to  millions  of  electrons  and  ions 
in  a  computational  box.  There  are  two  essential  elements 
to  an  electrostatic  PIC  code.  The  fust  is  the  particle  push, 
in  which  the  positions  and  velocities  of  all  of  the  particles 
are  advanced  in  time  subject  to  any  external  magnetic  field 
and  the  self  consistent  electric,  and  their  charges  are 
interpolated  onto  the  field  grid.  The  second  is  the  field 
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solve,  in  which  the  elecuic  field  is  updated  based  upon  the 
new  particle  positions.  These  two  code  sections  represent 
the  vast  majority  of  the  computation.  Additional 
computation  is  required  for  diagnostics  which  are  done 
periodically  throughout  the  simulation,  but  represent  an 
ignorable  fraction  of  the  total  compulation  time.  Thus  an 
efficient  implementation  of  a  PIC  code  requires  an 
efficient  implementation  of  the  particle  push  and  field 
solve.  Since  the  particle  push  represents  the  major 
fraction  of  the  computation  time,  it  is  essential  on  a 
distributed  memory  machine  to  have  approximately  equal 
numbers  of  particles  in  each  processor.  The  field  grid 
must  be  distributed  as  well  for  the  purpose  of  solving  for 
the  new  fields,  and  in  a  manner  which  is  not  necessarily 
the  same  as  that  needed  to  push  the  particles.  We  refer  to 
these  two  decompositions  as  the  primary  (particle)  and 
secondary  (field)  decompositions. 

Our  2D  PIC  code  is  periodic  in  one  dimension  and  may  be 
periodic  or  bounded  in  the  other  dimension.  As  the 
particles  are  advanced  in  time,  some  may  traverse  the 
entire  grid  space  during  the  course  of  the  simulation.  The 
simplest  primary  (particle)  decomposition  which  handles 
this  problem  is  the  static  decomposition,  in  which  each 
processor  keeps  a  copy  of  the  entire  field  grid  and  the 
particles  are  partitioned  at  the  beginning  of  the  simulation 
among  processors.  This  technique  guarantees  that  load 
balance  is  maintained  throughout  the  simulation,  at  the 
expense  of  redundant  copies  of  the  fields  in  eve'^y 
processor.  Using  the  static  decomposition,  we  have 
obtained  efficiencies  for  the  push  in  excess  of  80%.  The 
major  inefficiency  of  this  method  results  from  the  need  to 
duplicate  the  charge  array  initialization  in  each  processor 
and  do  a  sum  over  processors  when  the  charge  array  is 
updated. 

The  next  level  of  sophistication  is  to  partition  the  field 
grid  as  well  as  particles,  so  that  each  processor  has  a 
unique  piece  of  the  simulation  space.  Because  of 
inhomogeneity  in  the  particle  density,  this  partition  may 
in  general  be  irregular  in  order  to  maintain  load  balance 
among  the  processors.  However,  there  is  a  large  class  of 
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2D  problems  which  has  the  property  of  being  relatively 
uniform  along  one  coordinate  direction,  especially  if  the 
problem  is  periodic  in  that  coordinate.  In  this  case,  a 
regular  decomposition  of  the  field  grid  among  processors 
along  the  coordinate  of  uniformity  (as  shown  in  Fig.  1) 
will  also  result  in  a  load  balanced  decomposition  of 
particles.  As  the  simulation  progresses,  some  particles 
will  traverse  the  entire  simulation  space.  Since  each 
processor  now  has  only  a  part  of  the  entire  field  grid,  it  is 
necessary  to  migrate  particles  from  one  processor  to 
another  as  they  evolve.  This  can  result  in  particle  load 
imbalance  if  the  net  flux  of  particles  out  of  each  processor 
is  not  zero.  A  particle  load  imbalance  could  develop 
during  the  course  of  the  simulation,  even  though  there  is 
perfect  load  balance  to  begin  with.  Fortunately,  for  the 
class  of  problems  for  which  this  decomposition  is 
appropriate,  significant  load  imbalance  does  not  develop 
due  to  the  physics. 


Proc  2 

proc  s 

HHMH  \  iilint  "  nniH— ^ 

Figure  1.  A  regular  particle  partition  for  4  processors. 
The  y  direction  is  the  coordinate  of  relative 
uniformity  in  this  case.  Space  is  subdivided  evenly 
among  processors,  leading  to  a  load  balanced  partition 
of  particles  as  well. 

If  the  regular  decomposition  cannot  be  used  effectively 
(some  device  physics  problems  can  have  large 
nonuniformities  along  all  coordinate  directions),  a  free 
form  decomposition  of  the  field  grid  may  be  necessary. 
These  pieces  can  be  of  different  size  in  general,  since 
nonuniformities  may  develop  during  the  course  of  a 
simulation.  To  maintain  load  balance,  the  distribution  of 
particles  and  field  grid  must  also  evolve  during  the  course 
of  the  simulation.  We  are  in  the  process  of  implementing 
the  same  algorithm  for  dynamic  load  balancing  as  has 
been  used  for  the  ID  PIC  code  [4].  The  grid  space  is 
partitioned  as  shown  in  Fig.  2  into  slices  so  that  each 
processor  handles  all  of  the  x  domain  for  a  particular  range 
of  the  y  domain.  Particles  migrate  between  processors  as 
they  traverse  the  computational  space.  The  grid  space 


may  be  repartitioned  as  the  density  in  the  simulation 
evolves  so  that  load  balance  is  maintained. 


X 


y 


Proc  1 


Proc  2 
Proc  3 


Proc  4 


/ 


dx  n(x,y) 


Figure  2.  Field  grid  partition  based  on  particle 
density  distribution.  Load  balance  requires  that 
particles  be  distributed  evenly  among  the  processors. 
Thus  each  processor  may  have  a  different  number  of 
grid  points. 


The  secondary  (field)  decomposition  is  made  to  update  the 
field  values  at  each  time  step.  We  calculate  the  new  fields 
by  solving  Poisson’s  equation  in  Fourier  space.  For  best 
performance  in  parallel,  we  compute  the  2D  FFT  as  two 
sets  of  ID  FFTs  along  each  coordinate  direction.  For  this 
solution  method,  the  decomposition  is  a  su^ightforward 
assignment  of  slices  of  the  grid  along  one  coordinate 
direction  to  each  processor,  as  shown  in  Fig.  3.  The 


Figure  3.  Field  grid  decomposition  for  the  2D  FFT. 
Each  processor  has  a  strip  of  the  field  grid  initially, 
such  that  it  can  do  ID  FFTs  in  x  for  its  subset  of  the 
y  dimension.  The  results  are  then  redistributed  so 
that  each  processor  now  has  a  strip  oriented  along  the 
y  coordinate  direction.  ID  FFTs  in  y  may  now  be 
performed  for  each  processor's  subset  of  kx. 
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FFTs  in  the  coordinate  direction  parallel  to  the  long  edge 
of  the  slice  are  performed.  Then  the  grid  is  repartitioned 
into  slices  along  the  other  coordinate  direction,  so  that  the 
second  set  of  FFTs  may  be  done. 

Diagnostics  are  done  in  parallel,  including  graphics,  by 
using  one  of  the  field  decompositions  described  above. 
Phase  space  plots,  for  example,  are  parallelized  using  the 
primary  decomposition,  while  contour  plots  of  potential 
are  done  using  the  secondary  field  decomposition.  The 
graphics  software  operates  in  parallel,  with  each  processor 
drawing  a  separate  portion  of  the  graph  corresponding  to 
its  part  of  the  diagnostic. 

Code  Operation  with  the  Regular  Particle 
Decomposition 

The  main  loop  of  the  2D  code  proceeds  as  follows.  The 
field  solver  takes  the  real  space  charge  distribution  which 
has  been  interpolated  onto  the  field  grid  and  transforms  it 
into  k  space  using  the  2D  FFT  algorithm  mentioned 
above.  Poisson's  equation  is  solved  in  k  space,  and  the  x 
and  y  components  of  the  electric  field  are  computed  from 
its  solution.  Then  the  two  electric  field  components  are 
transformed  back  to  real  space.  Since  the  x  space  field 
grid  decomposition  is  the  same  as  the  particle 
decomposition  when  using  the  regular  grid  primary 


decomposition,  no  addition  grid  rearrangement  is  required 
to  begin  the  push.  However,  interpolating  the  field  from 
the  grid  for  all  of  the  particles  in  the  regular 
decomposition  requires  guard  rows  on  both  sides  of  the 
grid,  since  particles  at  a  decomposition  boundary  require 
field  information  which  is  contained  in  a  neighboring 
processor.  This  guard  row  information  is  exchanged 
between  processor  neighbors  before  the  push  phase 
begins.  By  mapping  the  processors  into  a  logical  ring, 
only  nearest  neighbor  communication  is  required  for  the 
exchanges.  The  push  phase  of  the  simulation  involves 
advancing  the  particles'  positions  and  velocities  one  time 
step,  then  interpolating  each  particle's  charge  back  onto 
the  field  grid  using  its  updated  position.  Since  some 
particle  charge  will  be  interpolated  onto  the  guard  rows, 
these  rows  must  be  combined  with  their  counterparts  in 
adjacent  processors  before  the  charge  deposition  is 
complete.  Again,  only  nearest  neighbor  communication 
is  required. 

Results 

In  Table  1,  we  present  timings  for  the  two  major  code 
section  which  are  executed  at  each  time  step  of  a 
simulation  run.  Two  test  problems  of  different  size  were 
timed.  In  each  test  case,  the  physics  problem  being 
modeled  was  the  same  (a  lower  hybrid  plasma  wave 


Timings  for  Critical  Code  Sections 
Mark  Illfp  Hypercube 


32  X  128  Field  Grid 
16,128  Panicles 


Number  of  Processors 

1 

2 

4 

8 

16 

32 

Solver  (sec) 

.742 

.427 

.275 

.205 

.183 

Note  1 

Push  (sec) 

1.74 

.861 

.421 

.207 

.108 

Note  1 

per  particle  (msec) 

107.9 

53.6 

26.1 

12.8 

6.7 

64  x  256  Field  Grid 
235,136  Particles 


Number  of  Processors 

1 

2 

4 

8 

16 

32 

Solver  (sec) 

Note  2 

1.70 

.996 

.652 

.498 

.449 

Push  (sec) 

Note  2 

13.2 

6.66 

3.34 

1.68 

.849 

per  particle  (msec) 

Note  2 

56.1 

28.3 

14.2 

7.1 

3.6 

Note  1  -  The  FFT  requires  that  nx  ,  the  number  of  grid  points  in  the  x  direction,  be  at  least  twice  the  number  of 
processors. 

Note  2  -  The  problem  was  too  large  to  fit  on  one  processor  alone. 

Table  1.  Measured  times  for  the  two  main  code  sections  in  BEPS.  Solver  and  Push  times  are  elapsed  times, 
including  communication.  Per  particle  time  is  push  time  divided  by  the  number  of  particles. 
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traveling  along  the  periodic  coordinate  was  excited  by  an 
antenna).  The  push  phase  in  each  case  shows  practically 
linear  speedup  as  the  number  of  processors  are  increased. 
The  solver  phase,  which  is  dominated  by  three  2D  FFTs, 
rapidly  saturates  in  speedup.  This  is  caused  by  an  increase 
in  the  amount  of  communication  required  by  the  grid 
redistribution  between  ID  FFTs  while  the  number  of  ID 
FFTs  done  in  each  processor  decreases.  This  is  a  problem 
with  FFT  based  solvers  in  general,  since  information  from 
each  grid  point  must  ultimately  be  combined  with 
information  from  every  other  grid  point  in  order  to 
compute  the  transform. 

In  Fig.  4  we  have  plotted  the  efficiency  of  each  code 
section  for  the  two  test  cases  as  a  function  of  the  number 
of  processors  employed.  We  define  efficiency  E  as 

E  =  N  Tn/Ti 

where  N  is  the  number  of  processors,  Tjq  is  the 
execution  time  on  N  processors,  and  T  ^  is  the  execution 
time  on  1  processor.  The  solver  efficiency  drops 
dramatically  as  the  number  of  processors  is  increased,  due 
to  the  increasing  communication  to  computation  ratio 
mentioned  above.  The  push  efficiency  remains  very  near 


computation  involved  in  updating  the  particle  positions 
and  velocities.  The  efficiencies  in  excess  of  100% 
achieved  for  the  smaller  test  case  by  the  push  phase 
simply  indicate  that  the  algorithm  being  used  in  the  push 
is  not  optimal  for  one  processor.  The  Mark  nifp  has  cash 
memory  associated  with  the  Weitek  Floating  Point 
Processors.  As  more  processors  are  used,  the  number  of 
particles  and  the  size  of  the  field  grid  each  processor 
handles  decreases,  resulting  in  a  lower  probability  of  cash 
misses.  The  increase  in  perfonnance  of  the  hardware  when 
using  the  cash  memory  more  than  makes  up  for  the 
addition  of  communication  overhead.  The  larger  test  case 
never  gets  subdivided  sufficiently  for  this  hardware  effect 
to  be  noticed. 

Since  the  primary  (particle)  decomposition  remains  fixed 
throughout  the  simulation,  the  possibility  of  particle  load 
imbalance  exists.  In  Fig.  5  we  plot  the  percentage  of  load 
imbalance  (%LI)  observed  in  the  smaller  test  case  running 
on  16  processors.  The  physics  of  the  problem  was 
changed  from  a  heating  simulation  to  a  current  drive 
simulation,  where  particles  are  accelerated  along  the 
periodic  coordinate.  This  number  is  defined  as 

%LI  =  (Dniax  ■  ®ave)/°ave 


BEPS  Performance  on  the  Mark  Illfp 
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Figure  4.  Measured  code  efficiencies.  The  push 
section  of  the  code  always  runs  close  to  100%.  The 
solver,  which  is  dominated  by  2D  FFTs,  suffers  rapid 
efficiency  degradation  as  the  number  of  processors  is 
increased. 


where  n^jg^  is  the  maximum  number  of  particles  in  any 
processor  and  n^ye  average  number  of  particles  per 
processor.  Even  though  particles  are  moving  (rather 


BEPS  Measured  Load  Imbalance 


100%,  independent  of  the  number  of  processors.  This 
demonstrates  that  the  communication  time  required  for 
migrating  particles  between  processors  and  exchanging 
guard  row  information  is  negligible  compared  to  the 


Figure  S.  Measured  particle  load  imbalance  with  the 
regular  particle  partition.  The  imbalance  is  defmed  as 
the  largest  percentage  deviation  of  any  processor’s 
particle  load  from  ideal  at  a  given  time  step. 
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rapidly)  between  processors,  the  largest  load  imbalance 
observed  during  the  first  100  time  steps  is  about  6.2%. 
The  load  imbalance  continues  to  oscillate  around  3.5%  for 
the  rest  of  the  simulation.  This  is  clearly  a  simulation 
from  the  class  where  the  fixed  particle  partition  works 
very  well.  We  believe,  however,  that  the  performance  of 
the  fixed  particle  partition  on  this  problem  is 
representative.  Since,  from  a  physics  standpoint,  it  is 
quite  difficult  to  develop  and  maintain  large 
inhomogeneities  in  all  coordinate  directions  in  a  plasma 
simulation,  we  also  believe  that  the  fixed  particle  partition 
is  applicable  to  a  wide  variety  of  problems  of  interest  to 
the  fusion  plasma  and  space  plasma  communities. 

Dynamic  Load  Balancing 

Of  course  not  all  simulation  problems  of  interest  are 
amenable  to  the  flxed  particle  partitioning  scheme.  For 
these  problems,  some  kind  of  irregular  partition  is 
necessary,  and  with  it,  the  ability  to  dynamically  balance 
the  particle  load  among  the  processors.  Fig.  6  illustrates 


X 

1  2  3  4 


Figure  6.  Dynamic  load  balancing  without  particle 
sorting.  The  charge  density  interpolated  onto  the  grid 
is  used  to  construct  a  density  function.  Partitioning 
is  done  based  on  this  function. 

a  load  balancing  scheme  which  does  not  require  particle 
sorting,  per  se.  Assume  that  an  irregular  partition  already 
exists  which  is  load  balanced.  After  the  particles  are 
advanced  in  time  and  passed  among  processors,  some  load 
imbalance  may  have  developed.  Rather  than  sorting  the 
particles  by  coordinate  to  determine  the  new  (load 
balanced)  partition,  the  particles  are  interpolated  onto  the 
charge  grid  in  the  current  partition.  Before  the  field  solve 
proceeds,  the  charge  density  is  used  to  determine  the  new 
partition  positions.  The  actual  method  of  determining  the 
new  partition  locations  is  not  important,  since  it  will 
scale  with  the  grid  size,  rather  than  the  number  of 
particles.  A  parallel  recursive  bisection  on  the  charge 


density  appears  to  be  an  attractive  choice.  We  are  in  the 
process  of  implementing  a  dynamic  load  balancing  scheme 
for  the  2D  code. 

Conclusions 

We  have  implemented  a  2D  electrostatic  PIC  code  for 
plasma  simulation  on  the  Mark  Illfp  Hypercube 
Concurrent  Computer.  The  code  is  completely 
parallelized,  including  diagnostics  and  graphics.  We  are 
currently  using  a  regular  primary  (particle)  partition, 
which  is  fixed  throughout  the  entire  simulation  run.  This 
Composition  exhibits  very  good  particle  load  balance  for 
a  large  class  of  plasma  problems.  Particle  push 
efficiencies  remain  close  to  100%  with  up  to  32 
processors.  Solver  performance,  which  is  based  upon 
FFT  performance,  degrades  rapidly  as  the  number  of 
processors  is  increased. 
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Abstract 

We  present  a  systematic  study  of  the  applicar 
bility  of  massively  parallel  computers,  the  AMT 
DAP-510/ol0  and  the  TMC  CM-2,  to  the  solu¬ 
tion  of  the  two-dimension?l  unsteady  Euler  equsi- 
tions  using  a  compact  high-order  scheme.  The 
performance  of  these  machines  is  compared  to 
that  of  the  Cray-2  and  the  Cray- YM  P/832  using 
the  same  algorithm  and  for  the  same  test  prob¬ 
lem. 


Introduction 

A  major  computational  challenge  is  to  calculate 
time  accurate  solutions  to  the  continuum  equa¬ 
tions  for  the  unsteady  flow  of  compressible  fluid 
in  two  and  three  dimensions  for  very  large  prob¬ 
lems  in  a  reasonable  time.  Computational  algo¬ 
rithms  have  been  developed  for  vector  computers 
because,  until  recently,  they  were  the  only  com¬ 
puting  engines  capable  of  providing  at  least  a  pro¬ 
portion  of  the  necessary  computing  power.  Very 
little  work  seems  to  have  been  done  to  develop 
algorithms  for  compressible  flow  calculations  on 
a  massively  parallel  SIMD  computer.  The  ma¬ 
jor  problem  in  using  such  massively  parallel  com¬ 
puters  is  to  develop  fine  grained  parallel  algo¬ 
rithms,  which  requires  an  understanding  of  the 
data  communication  and  synchronization  require¬ 
ments  of  these  algorithms  and  development  of  ef- 
flcient  techniques  to  map  the  computational  do¬ 
main  onto  the  set  of  processors. 

Recently,  Agarwal  and  Richardson  [2]  devel¬ 
oped  an  Euler  code  for  the  TMC  Connection  Ma^ 
chine.  They  used  a  finite-volume  discretization 
scheme  coupled  with  a  fourth-order  Runge-Kutta 
integrator  to  advance  the  solution  in  time.  This 
scheme  is  second-order  accurate  in  space.  In  ad¬ 
dition,  since  shocks  can  develop  within  the  flow 


field,  the  finite-volume  method  is  slightly  altered 
to  allow  an  artificial  dissipation  term,  which  is 
itself  second-order.  The  implementation  of  this 
scheme  is  then  discussed  and  results  were  pre¬ 
sented  for  a  specific  problem.  Among  the  re¬ 
sults  presented,  the  authors  showed  that  for  large 
problems,  the  CM-2  was  faster  than  the  Cray 
XMP/18. 

In  this  paper  we  present  a  systematic  study 
of  the  applicability  of  massively  parallel  com¬ 
puters,  the  AMT  DAP-510/610  and  the  CM- 
2,  to  the  solution  of  the  two-dimensional  un¬ 
steady  Euler  equations  using  a  compact  high- 
order  scheme.  The  performance  of  these  machines 
is  then  compared  to  that  of  the  Cray-2  and  the 
Cray-YMP/832  using  a  vector  form  of  the  same 
algorithm  and  the  same  test  problem. 

Formulation 


The  two-dimensional  Euler  equations  are, 


Vt  +  /r  +  —  0 


with 

/  P  \ 

u= 

pv 

\  E  J 

^  pu  > 
r_  pu^+p 
^  puv 

\  u{E  +  p)  / 

f  pv 

-  pvu 

o  =  o 

pv^  +  p 

\  v(E  +  p)  / 


(1) 

(2) 


(3) 


(4) 


where  p,  u,v,E,  and  p  are  respectively  the  density, 
velocity  components  in  the  x  and  y  directions,  the 
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total  energy  per  unit  volume  and  the  pressure.  In 
addition,  the  equation  of  state  is  taken  to  be  that 
of  a  perfect  gas  and  is  given  by, 

E  =  p/(7  -  1)  +  P(u*  +  v^)/2  (5) 

Here  y  is  the  ratio  of  specific  heats  taken  to  be 
1.4. 

We  are  interested  in  simulating  the  evolution 
of  supersonic  and  hypersonic  mixing  layers.  In 
order  to  obtain  accurate  results  in  such  a  simu¬ 
lation  one  must  control  the  dispersive  errors  and, 
for  centered  schemes,  [l]  these  are  mostly  of  third 
order  hence  suggesting  the  need  for  a  higher  than 
second  order  scheme.  Thus  we  chose  the  com¬ 
pact  four  step  fourth-order  Runge-Kutta  scheme 
of  Abarbanel  and  Kumar  [1].  This  explicit  algo¬ 
rithm  has  both  fourth-order  spatial  and  temporal 
accuracy  as  well  as  efficiency  and  ease  of  imple¬ 
mentation.  The  Abarbanel  and  Kumar  algorithm 
takes  the  form  (where  we  drop  the  sup-arrows  in¬ 
dicating  a  vector); 


ti° 

= 

u" 

u" 

ti2 

= 

u" 

-  ^AtR(u‘) 

= 

u" 

-  5AtR(«2) 

u-* 

= 

u" 

-  AtR{u^) 

,n  +  l 

— 

where 

erator 

R  = 


R  is  the  compact  fourth  order  spatial  op- 


with  6  the  centered  difference  operator  and  /i 
the  averaging  operator.  When  shocks  are  present 
within  the  flow  field  the  residual  R  must  be  mod¬ 
ified  to  include  an  explicit  artificial  viscosity  term 
which  Ccui  be  of  second  or  fourth  order  (1). 

Boundary  conditions  at  solid  walls  are  imposed 
by  placing  the  wall  between  a  pair  of  adjacent 
grid  points.  For  scalar  variables,  such  as  density, 
pressure,  etc.  the  boundary  condition  is  that  the 
gradient  normal  to  the  boundary  is  zero  at  the 
boundary.  Thus  the  values  of  the  ’’ghost”  vari¬ 
ables  outside  the  flow  domain  are  set  equal  to  the 
values  of  the  corresponding  variables  inside  the 


boundary.  For  vector  variables,  the  component 
tangential  to  the  boundary  is  treated  in  the  same 
way  as  the  scalar  variables.  But  the  boundary 
condition  for  the  component  normal  to  the  bound¬ 
ary  is  that  the  component  is  zero  at  the  boundary. 
This  requires  that  the  corresponding  ’’ghost”  val¬ 
ues  are  the  negatives  of  the  values  just  within  the 
boundary.  Application  of  these  boundary  condi¬ 
tions  requires  special  care  in  the  region  surround¬ 
ing  a  convex  corner.  The  test  problems  which  we 
report  on  here  Me  both  supersonic.  Therefore  all 
values  of  the  variables  are  set  at  inflow  boundaries 
(of  course  these  must  be  consistent)  and  no  values 
can  be  set  at  outflow  boundaries.  This  requires 
that  one  use  extrapolation  at  outflow  boundaries. 
More  general  boundary  conditions  are  briefly  dis¬ 
cussed  below. 


Implementation 

We  have  implemented  this  algorithm  on  a  DAP- 
510  (at  Old  Dominion  Univ.)  and  610  (at  Univ. 
of  Cambridge)  and  on  a  CM-2  (at  Argonne  Natl. 
Lab).  The  DAP-510(610)  consists  of  1024(4096) 
single  bit  processors  arranged  in  a  32  x  32  (64  x  64) 
array.  Each  processor  is  provided  with  connection 
to  its  four  nearest  neighbors.  In  addition,  a  bus 
system  connects  all  the  processors  in  each  row  and 
all  the  processors  in  each  column.  Each  processor 
has  a  local  memory  of  64  Kbits. 

The  CM-2  can  have  up  to  64K  physical  proces¬ 
sors.  These  are  1-bit  processors  each  with  64K 
bits  of  local  memory.  In  addition,  the  CM-2  has 
one  floating  point  processor  for  each  set  of  32  CM 
processors.  The  CM  parallel  instruction  set  pro¬ 
vides  a  virtual  processor  facility  that  allows  each 
physical  processor  to  simulate  some  number  of 
virtual  processors.  To  transfer  data  among  vir¬ 
tual  processor,  the  instruction  set  supports  two 
interprocessor  communication  mechanism:  gen¬ 
eral  communication  and  gridwise  communication. 

The  DAP  and  CM-2  codes  were  written  in  For¬ 
tran  Plus  and  CM-Fortran,  respectively.  In  the 
case  of  CM-2,  we  used  gridwise  communication 
to  transfer  data  2imong  virtual  processors. 

The  two  dimensional  space  of  the  problem  was 
divided  into  an  equi-spaced  grid.  Each  point  of 
the  grid  is  mapped  onto  one  processor  of  the  ma^ 
chine  if  the  number  of  grid  points  is  less  than  or 
equal  to  the  number  of  processors.  For  a  grid  size 
exceeding  the  number  of  physical  processors  each 
processor  acts  as  a  virtual  processor  so  long  as 
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Figure  1;  Shock  reflection  from  a  flat  plate. 


the  memory  size  is  not  exceeded.  Whenever  an 
obstacle  is  present  in  the  path  of  flow,  the  proces¬ 
sors  corresponding  to  the  grid  points  within  the 
obstacle  are  masked.  That  is,  the  values  are  not 
stored  at  those  processors. 

The  implementation  of  this  algorithm  to  com¬ 
pute  the  values  of  each  component  of  u  at  a  pro¬ 
cessor  consists  of  four  similar  steps,  one  for  each 
step  in  the  Runge-Kutta  algorithm.  Each  step  is 
in  turn  composed  mainly  of  two  sub-steps  namely 
(i)  acquire,  and  (ii)  combine.  In  the  acquire  step 
the  values  of  elements  of  /  and  g  from  eight  neigh¬ 
boring  processors  are  fetched.  Four  of  these  eight 
processors  are  directly  connected  to  the  proces¬ 
sor  concerned  and  hence  values  from  those  are 
fetched  in  a  single  step.  The  values  from  the 
other  four  processors  are  acquired  in  two  steps. 
The  eight  values  fetched  in  the  acquire  step  are 
then  combined  to  evaluate  R.  The  artiflcial  vis¬ 
cosity  term  is  similarly  computed.  Subsequently 
u  is  computed  in  the  same  way.  The  computation 
required  in  the  acquire  and  combine  step  is  done 
in  parallel  by  all  of  the  processors  in  the  array. 
The  boundary  conditions  are  handled  by  merging 
rows  and/or  columns  of  data  inside  and  outside 
the  boundary.  This  is  an  efficient  operation  so 
that  there  is  little  deterioration  in  the  speed  of 
computation. 

Flow  Simulations 

We  report  results  for  two  examples:  (i)  a  shock 
reflecting  from  a  flat  plate,  and  (ii)  a  Mach  3  flow 
into  a  channel  with  a  forward  facing  step.  Both 
examples  are  supersonic  in  character  with  shocks 
present,  with  the  second  example  having  a  small 
recirculation  region  in  front  of  the  step. 


The  geometry  of  problem  (i)  is  shown  in  fig¬ 
ure  (1).  The  solid  lines  show  the  position  of  the 
shocks.  The  flow  at  the  inflow,  below  the  shock 
at  X  =  0.0,  is  parallel  to  the  plate  and  has  a  Mach 
number  of  1.95.  The  flow  in  the  region  above  the 
shocks  has  a  Mach  number  of  1.7736  and  is  in¬ 
clined  at  an  angle  of  5  degrees  towards  the  plate. 
This  problem  has  an  exsu:t  solution,  for  further 
details  see  [1]  who  also  used  this  as  a  test  prob¬ 
lem.  This  test  problem  was  run  on  a  DAP-510 
using  64  grid  points  in  the  x  dl’-ect'  j.:  and  32  grid 
points  in  the  y  direction.  Ti  >  Hea;'''  '.ate  pres¬ 
sure  distribution  along  the  sec‘  >n.  '  a),  and  (b- 
b)  shown  in  figure  1  are  plotted  in  ngures  2,  and 
3. 

The  pressure  distributions  show  that  the  shocks 
are  two  to  three  grid  points  thick  and  also  that 
there  are  small  oscillations  near  the  shocks.  The 
solution  in  the  region  to  the  right  of  the  reflected 
shock  is  that  given  by  theory.  These  results  are 
very  similar  to  those  of  Abarb2mel  and  Kumar  [1]. 

Results  for  the  second  test  problem  are  shown 
in  figures  4,  5,  6  and  7,  where  we  plot  contours  of 
the  density  in  the  channel  flow  at  four  different 
dimensionless  times,  t  =  0.25,  0.5,  1.0,  and  2.0. 
Here  the  length  scale,  L,  is  the  height  of  the  ch2ui- 
nel  at  the  inflow  boundary  and  the  velocity  scale 
is  c,  the  sound  speed  of  the  incoming  gas.  Thus 
the  time  scale  is  L/c.  The  impulsive  start  of  a 
Mach  3  flow  into  a  channel  with  a  forweird  facing 
step  is  a  severe  test  of  the  robustness  of  a  nu¬ 
merical  algorithm  and  has  been  used  by  Lohner, 
Morgan,  and  Zienkiewicz  [6],  and  Glaister  [5],  for 
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Figure  3:  Pressure  along  section  (b-b). 

example,  as  a  test  problem. 

The  results  shown  here  were  obtained  on  a 
DAP-510.  The  grid  for  these  runs  was  128  x  32 
points.  At  t  =  0.0  a  highly  curved  bow  shock 
is  generated  at  the  step  and  begins  to  propagate 
upstream  and  over  the  step.  At  t  =  0.25  (fig¬ 
ure  4)  the  shock  is  curved  around  the  step  and 
there  is  a  region  of  high  density  and  slow  circula¬ 
tion  just  ahead  of  the  step.  Notice  that  the  shock 
has  a  thickness  of  about  2  to  3  grid  points.  By 
the  time  t  =  0.50  (figure  5)  the  shock  has  moved 
slightly  upstream  of  the  step  but  has  not  yet  hit 
the  top  wall  of  the  channel.  The  shapes  of  the 
density  contours  in  figures  4  and  5  are  similar 
as  would  be  expected.  Figure  6  shows  the  den¬ 
sity  contours  at  t  =  1.0,  after  the  shock  has  un¬ 
dergone  a  reflection  from  the  top  wall.  The  bow 
shock  thickness  is  essentially  unchanged  but  now 
there  is  a  reflected  shock  moving  down  from  the 
upper  wall.  Finally,  in  figure  7,  at  t  =  2.0  one  can 
see  that  the  shock  generated  by  reflection  at  the 
upper  wall  has  just  begun  to  reflect  again  from 
the  lower  wall.  The  results  of  these  test  problem 
suggest  that  the  Abarbanel-Kumar  algorithm  is 
both  accurate  and  robust  and  can  be  used  with 
confidence. 

Performance 

Here  we  report  the  performance  results  for  the 
DAP-510  and  610  as  well  as  for  the  CM-2,  us¬ 
ing  only  8K  processors,  which  was  the  maximum 
number  available  to  us.  We  compare  our  tim¬ 
ings  for  these  machines  with  those  obtained  using 


a  Cray-2  and  a  Cray-YMP/832  to  solve  problem 
(ii).  We  have  run  the  channel  problem  with  a 
variety  of  grid  sizes  ranging  from  96  x  32  points 
to  256  X  64  points  on  the  DAP-510  and  610,  on 
the  CM-2,  on  a  Cray-2,  and  on  a  Cray-YMP/832. 
Timing  results  are  given  in  table  1  in  terms  of 
(a)  the  number  of  seconds  required  to  compute 
one  full  time  step  for  each  of  these  grid  sizes  on 
each  of  these  computers  ,(b)  the  relative  speed  of 
computation  using  a  256  x  64  grid,  and  (c)  the 
processing  rate. 


Machine 

Problem 

Size 

Time 

Sec. 

Rel. 

Speed 

Rate 

Mfp 

CM-2 

96  X  32 

0.428 

3.5 

CM-2 

256  X  64 

0.474 

0.56 

17 

DAP-510 

96  X  32 

0.200 

7.4 

DAP-510 

128  X  32 

0.263 

7.5 

DAP-510 

256  X  32 

0.519 

7.6 

DAP-610 

256  X  64 

0.263 

1.00 

30 

Cray-2 

256  X  64 

0.113 

2.32 

70 

Cray-YMP 

256  X  64 

0.0617 

4.26 

128 

Table  1  ;  Timings  and  Relative  Speed 

From  table  1  one  can  see  that,  first,  there  must 
be  a  major  bottleneck,  independent  of  problem 
size,  in  our  CM'2  implementation.  Increasing  the 
problem  size  by  a  factor  of  5.3  results  in  an  in¬ 
crease  in  the  execution  time  per  time  step  of  only 
11  %  and  a  consequent  increase  in  the  process¬ 
ing  rate  from  3.5  Mflops  to  17  Mflops.  We  have 
not  been  able  to  find  this  bottleneck;  perhaps  be¬ 
cause  of  our  relative  lack  of  experience  with  the 
CM-2.  One  of  our  major  goals  is  to  determine  the 
cause  of  this  bottleneck  in  order  to  see  whether 
or  not  it  is  intrinsic  to  the  algorithm/architecture 
combination. 

In  contrast,  the  DAP  implementation  has  been 
very  successful,  perhaps  because  one  of  us  (CEG) 
has  had  extensive  experience  in  programming 
this,  and  previous,  models  of  the  DAP.  The  pro¬ 
cessing  times  and  rates  scale  very  nearly  linearly 
with  problem  size  and  are  very  close  to  optimum; 
that  is  these  rates  are  about  95  %  of  the  maxi¬ 
mum  one  would  calculate  by  counting  the  num¬ 
ber  of  floating  point  operations  and  computing 
the  run  time  by  multiplying  the  number  of  oper¬ 
ations  by  the  time  per  operation.  The  rates  given 
in  table  one  do  not  include  the  cost  of  the  run 
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Figure  4:  Density  contours  &t  t  =  0.25. 


Figure  5:  Density  contours  at  t  =:  0.50. 


Figure  7:  Density  contours  at  t  =  2.00. 


430 


time  graphics  where  we  output  color  contours  of 
the  density,  pressure  and  velocity  components  to 
the  monitor  every  time  step.  The  cost  of  the  run 
time  graphics  is  less  them  6  %  of  that  of  the  time 
step. 

The  processing  rates  for  the  Cray  machines, 
which  are  substantially  less  than  the  theoretical 
peak  processing  rates,  Me  not  very  surprising  in 
view  of  the  results  of  our  previous  studies  [3,4]. 
It  seems  that  these  rather  low  relative  processing 
rates  are  entirely  due  to  inadequate  band  width 
to  memory  and  memory  bank  conflicts.  This  is  a 
clear  example  of  the  ”Von  Neuman  Bottleneck". 
It  seems  that  massively  parallel  processors  with 
a  substantial  local  memory  per  processor  do  not 
experience  this  bottleneck. 

Future  Work 

One  of  our  major  goals,  as  mentioned  above, 
is  to  determine  why  the  performance  of  the  algo¬ 
rithm  on  the  CM-2  is  so  disappointing.  We  need 
to  find  out  whether  or  not  this  behavior  is  intrin¬ 
sic  to  the  algorithm/architecture  combination  or 
is  an  artifact  of  our  implementation. 

We  are  testing  a  DAP  code  for  a  direct  nu¬ 
merical  simulation  of  a  two  dimensional,  spatially 
evolving,  unstable,  supersonic  mixing  layer  on  a 
96  by  736  grid.  We  are  using  the  Abarbanel  and 
Kumar  [1]  algorithm  described  above.  Because 
there  are  subsonic  regions  in  the  flow  and  the  flow 
is  unsteady,  the  boundary  conditions  must  be  ca¬ 
pable  of  handling  time  dependent  inflow  and  out¬ 
flow  conditions.  We  are  now  implementing  char¬ 
acteristic  inflow/outflow  boundary  conditions,  see 
Thompson  [7]  for  details. 

We  hope  that  this  simulation  will  give  us  insight 
into  the  nonlinear  evolution  and  rollup  of  high 
Mach  number  mixing  layers.  This  simulation  is  a 
major  computational  task  and  will  require  very 
substantial  amounts  of  computing  time  on  our 
DAP-510.  It  is  hoped  that  we  can  upgrade  our 
machine  to  a  510C.  This  version,  just  announced, 
has  an  8  bit  coprocessor  for  each  bit  processor  in 
the  array.  It  is  estimated  that  this  will  increase 
the  floating  point  performance  of  the  DAP  by  a 
factor  of  5  to  10.  Such  an  increase  would  be  very 
welcome. 

Finally,  we  intend  to  expand  our  study  of  paral¬ 
lel  algorithms  for  Gas  Dynamics  to  include  TVD 
and  ENO  schemes  for  the  Euler  equations.  We 
al.so  will  extend  the  algorithms  so  as  to  include 


general  geometries  using  mapping  techniques.  Fi¬ 
nally,  we  plan  to  develop  parallel  algorithms  for 
the  compressible  Navier-Stokes  equations  suitable 
for  massively  parallel  computers. 

Conclusions 

This  study  has  shown  that  accurate  and  ef¬ 
ficient  finite  difference  algorithms  for  the  Euler 
equations  can  be  adapted  to  massively  parallel 
computers.  The  overall  performance  of  these 
codes  are  somewhat  less  than,  but  comparable 
to  that  of  vector  codes  for  the  same  algorithms 
on  the  Cray-2  and  Cray-YMP.  Further  work  to 
study  other  algorithms,  general  geometries,  and 
inflow/outflow  boundaries  conditions  seems  war¬ 
ranted. 
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Abstract 

Vortex  methods  are  a  powerful  tool  for  the  numeri¬ 
cal  simulation  of  incompressible  flows  at  high  Reynolds 
number.  They  are  based  on  a  discrete  representation  of 
the  vorticity  field  and  in  the  inviscid  limit,  the  compu¬ 
tational  elements,  or  vortices,  are  simply  advected  at 
the  local  fluid  velocity.  The  numerical  approximations 
transform  the  vorticity  equation,  a  non-linear  PDE, 
into  a  N-body  problem.  The  0{N^)  time  complexity 
usually  associated  with  these  problems  has  limited  the 
number  of  computational  elements  to  a  few  thousands. 
This  paper  is  concerned  with  the  concurrent  implemen¬ 
tation  of  fast  vortex  methods  that  reduce  the  time  com¬ 
plexity  to  0{N\ogN).  The  fast  algorithm  that  is  used 
combines  a  binary  tree  data  structure  with  high  order 
expansions  for  the  induced  velocity  field.  The  imple¬ 
mentation  of  this  particular  algorithm  on  an  MIMD 
architecture  is  discussed. 


V2u  =  -Vx(we,)  .  (3) 

Using  complex  notation,  the  velocity  induced  by  an 
isolated  vortex  particle, 


u{z,t)  -a6{z-  Zc,{t))  ,  (4) 

is 


tfv  1 

w(z,<)  =  u{z,t)  +  iv{z,t)  =  .  (5) 


Vortex  Methods 

Vortex  methods  (see  Leonard[l])  are  used  to  sim¬ 
ulate  incompressible  flows  at  high  Reynolds  number. 
The  two-dimensional  inviscid  vorticity  equation. 


-l-u  •  Vw  =  0  ,  (1) 


where  z*  is  the  complex  conjugate  of  z.  Since  Eq.(3) 
is  linear,  superposition  is  used  to  determine  that  the 
velocity  field  induced  by 

N 

‘^(2-0  = 

j 

is  given  by 


is  solved  by  discretizing  the  vorticity  field  into  La- 
grangian  vortex  particles, 

N 

.  (2) 

j 

where  Oj  is  the  strength  or  the  circulation  of  the 
particle.  For  an  incompressible  flow,  the  knowledge 
of  the  vorticity  is  sufficient  to  reconstruct  the  velocity 
field.  The  discrete  representation  of  the  vorticity  field 
can  be  used  to  solve 


(7) 


The  velocity  is  evaluated  at  each  particle  loca¬ 
tion  and  the  discrete  Lagrangian  elements  are  sim¬ 
ply  advected  at  the  local  fluid  velocity.  In  this  way, 
the  numerical  scheme  approximately  satisfies  Kelvin 
&  Helmholtz  theorems  that  govern  the  motion  of  vor¬ 
tex  lines. 
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The  numerical  approximations  have  transformed 
the  original  partial  differential  equation  into  a  set  of 
2N  ordinary  differential  equations:  an  fV-body  prob¬ 
lem.  This  class  of  problems  is  encountered  in  many 
fields  of  computational  physics,  e.g.,  molecular  dynam¬ 
ics,  gravitational  interactions,  plasma  physics  and  of 
course,  vortex  dynamics.  It  involves  a  summation  over 
{N  —  1)  interactions  that  has  to  be  evaluated  N  times. 
Even  if  syrrunetry  is  used  to  reduce  the  number  of  in¬ 
teractions  by  half,  the  resulting  time  complexity 
makes  simulations  using  more  than  a  few  thousands 
particles  prohibitively  expensive. 

Fast  Algorithms 

When  each  pairwise  interaction  is  considered,  dis¬ 
tant  and  nearby  pairs  of  vortices  are  treated  with  the 
same  care.  As  a  result,  a  disproportionate  amount  of 
time  is  spent  computing  the  influence  of  distant  vor¬ 
tices  that  have  little  influence  on  the  velocity  of  a  given 
particle.  This  is  not  to  say  that  the  far-held  is  to  be 
totally  ignored  since  the  accumulation  of  small  con¬ 
tributions  can  have  a  significant  effect.  The  key  el¬ 
ement  in  making  the  velocity  evaluation  faster  is  to 
approximate  the  influence  of  the  far-field  by  consider¬ 
ing  groups  of  vortices  instead  of  the  individual  vortices 
themselves.  When  the  collective  influence  of  a  distant 
group  of  vortices  is  to  be  evaluated,  the  very  accurate 
representation  of  the  group  provided  by  its  vortices 
can  be  overlooked  and  a  cruder  description  that  re¬ 
tains  only  its  most  important  features  can  be  used. 
These  would  be  the  group  location,  circulation,  and 
possibly,  some  coarse  approximation  of  its  shape  and 
vorticity  distribution. 

Far-field  approximations 

A  convenient  approximate  representation  is  based  on 
multipole  expansions.  Consider  a  compact  group  of  J 
point  vortices. 


^ _ 


Outside  of  the  group,  the  velocity  field  can  be  rewrit¬ 
ten  as  a  truncated  multipole  expansion. 


which  is  valid  for  \z  —  z,^\  >  r^.  The  coefficients  a* 
are  defined  as 


and  in  general,  they  are  complex  numbers.  The  con¬ 
tribution  from  the  first  neglected  term  drops  like 


Therefore,  even  a  truncated  series  will  provide  an  ac¬ 
curate  velocity  estimate  far  from  . 

It  would  be  possible  to  build  a  fast  algorithm  at 
this  stage  by  evaluating  the  multipole  expansion  at  the 
location  of  particles  that  don’t  belong  to  the  group. 
This  is  basically  the  scheme  used  by  Barnes  &  Hut 
[3].  Greengard  &  Rokhlin  [4]  went  a  step  further  by 
proposing  group  to  group  interactions.  In  this  case, 
the  multipole  expansion  is  transformed  into  a  Taylor 
series  stround  the  center  of  the  second  group,  z^,  where 
the  influence  of  the  first  one  is  sought.  In  the  neigh¬ 
borhood  of  z^,  the  induced  velocity  can  be  written  as 


where  all  vortices  are  located  within  a  radius  of  the 
group  center,  z,^,.  As  discussed  below,  z^  is  chosen  in 
such  a  way  to  make  the  group  as  compact  as  possible. 
Other  authors,  like  Appel  [2],  saw  some  benefits  in 
locating  z^  at  the  center  of  vorticity.  In  any  event, 
the  vortices  induce  a  velocity  that  can  be  expressed  as 


w,(z)  =  6o  -I-  6i(z  -  Zj.)  -I-  62(2  -z^)^  + 


where 


An  interaction  between  two  groups  consists  of 
finding  the  coefficients  of  the  Taylor  series  from  the 
knowledge  of  the  relative  location  of  the  groups  and 
their  respective  multipole  expansion.  The  work  associ¬ 
ated  with  this  interaction  is  independent  of  the  number 
of  vortices  in  the  groups.  Consequently,  the  speedup 
over  the  approach  is  more  interesting  when  large 
groups  are  involved.  On  the  other  hand,  if  the  groups 
are  small,  it  might  be  cheaper  to  consider  every  pair- 
wise  interactions  between  vortices.  Assuming  that  the 
groups  involved  in  the  interaction  have  the  same  num¬ 
ber  of  vortices,  J,  the  critical  J  for  which  pairwise 
interactions  of  vortices  require  the  same  computational 
effort  as  one  group  to  group  interaction  will  be  referred 
to  as  Jmin-  No  group  with  less  than  Jmin  vortices  will 
be  allowed  since  they  would  slow  down  the  simulation. 

The  threshold  's  a  function  of  L,  the  number 
of  terms  in  the  expansions.  Since  the  work  required 
to  compute  one  group  to  group  interaction  is  of  or¬ 
der  O(L^),  it  might  seems  preferable  to  keep  L  to  a 
minimum  but  then  a  larger  error  would  result  from 
each  approximation.  The  error,  c,  is  defined  as  the 
difference  between  the  velocities  obtained  from  a  given 
group  to  group  approximation  and  the  ones  resulting 
from  all  the  pairwise  interactions  of  the  groups’  mem¬ 
bers.  Greengard  &  Rokhlin  have  shown  that  when  the 
same  number  of  terms  is  kept  in  both  expansions,  e  is 
bounded  by 

where 


These  three  quantities  are  known  and  an  error  es¬ 
timate  can  be  found  for  any  pair  of  groups.  If  this 
estimate  is  smaller  than  an  arbitrary  criterion,  c,  the 
approximation  is  judged  acceptable  and  the  computa¬ 
tion  can  proceed  with  that  group  to  group  interaction. 
If  not,  at  least  one  group  is  too  large  and  the  approxi¬ 
mation  is  rejected  since  it  would  result  in  a  significant 
error.  In  that  case,  the  larger  group  is  subdivided  into 
two  smaller  ones  and  an  error  estimate  is  found  for  the 
two  new  pairs  of  groups.  If  the  error  is  still  too  large, 
the  procedure  is  repeated  until  a  valid  approximation 
is  found  or  until  the  smallest  groups  are  reached.  In 
the  latter  situation,  pairwise  interactions  between  vor¬ 
tices  are  used  to  determine  the  influence  of  one  group 
on  another. 


and 


Figure  1  Fast  algorithm’s  data  structure 


i-ma*  =  max(r^,r*,) 


A  =  . 


j 


Data  structure 

One  now  needs  a  data  structure  that  is  going  to  fa- 
(18)  ciiitate  the  search  for  acceptable  approximations.  As 
proposed  by  Appel,  a  binary  tree  is  used. 
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In  that  framework,  a  giant  cluster  sits  on  top  of  the 
data  structure;  it  includes  all  the  vortex  particles.  It 
stores  all  the  information  relevant  to  the  group,  i.e.,  its 
location,  its  radius  and  the  coefficients  of  the  multipole 
expansion.  In  addition,  it  carries  the  address  of  its  two 
children,  each  of  them  responsible  for  approximately 
half  of  the  vortices  of  the  parent  group.  Whenever 
smaller  groups  are  sought,  these  pointers  are  used  to 
rapidly  access  the  relevant  information.  The  children 
carry  the  description  of  their  own  group  of  vortices 
and  are  themselves  pointing  at  two  smaller  groups, 
their  own  children,  the  grand-children  of  the  patriar¬ 
chal  group.  More  subgroups  are  created  by  equally 
dividing  the  vortices  of  the  parent  groups  along  the 
“x”  and  “y”  axis  alternatively.  This  splitting  process 
stops  when  all  groups  have  approximately  Jmin  vor¬ 
tices.  Then,  instead  of  pointing  toward  two  smaller 
groups,  the  parent  node  points  toward  a  list  of  vor¬ 
tices.  As  shown  in  Fig.(l),  the  data  structure  provides 
a  quick  way  to  access  groups,  from  the  largest  to  the 
smallest  ones,  and  ultimately  to  the  individual  vortices 
themselves. 

Velocity  evaluations 

Once  the  groups  have  been  identified  and  hierarchi¬ 
cally  ordered,  the  coefficients  of  the  multipole  expan¬ 
sion  that  will  represent  everyone  of  them  need  to  be 
evaluated.  Having  Jiccess  to  the  vortices  belonging  to 
every  group,  Eq,(ll)  could  be  used  for  this  purpose 
but  it  would  be  costly,  especially  for  the  larger  groups. 
This  expression  is  only  used  to  find  the  coefficients 
of  the  smallest  groups  in  the  data  structure,  the  ones 
that  have  direct  access  to  the  vortices.  Then,  the  co¬ 
efficients  of  the  children  are  used  to  find  the  multipole 
expansion  of  their  parent  group.  The  expansions  are 
constructed  from  the  bottom  up.  The  coefficients  of 
the  left  child  adequately  describe  its  content  with  re¬ 
spect  to  the  center  of  its  group,  2„.  To  represent  the 
left  half  of  the  parent  node,  that  multipole  expansion 
has  to  be  shifted  to  the  center  of  the  parent  node,  2„' 
and  the  new  coefficients  are: 


=  .  (19) 

The  same  operation  is  repeated  for  the  right  child 
and  its  shifted  coefficients  are  added  to  the  ones  of  the 
left  child  to  form  the  multipole  expansion  of  the  parent 
group.  Recursive  subroutines  are  used  to  repeat  this 
assembling  process  until  the  top  of  the  binary  tree  is 
reached. 


Once  the  data  structure  is  ready,  the  velocity  eval¬ 
uations  can  take  place.  The  search  for  suitable  pair  of 
groups  is  done  with  the  help  of  recursive  subroutines, 
withinO  and  between(),  similar  to  the  ones  used  by 
Appel.  The  subroutine  between()  finds  the  influence 
of  one  group  on  another  while  within()  computes  the 
velocities  within  a  group.  It  does  so  by  finding  the  in¬ 
teraction  between  its  left  and  right  halves,  after  which 
the  subroutine  calls  itself  to  compute  the  interactions 
within  each  half.  A  within()  of  an  indivisible  group 
is  simply  the  interaction  of  all  its  members. 

When  determining  the  mutual  influence  of  two 
groups,  between()  first  checks  the  error  estimate  as¬ 
sociated  with  that  pair  of  groups.  If  it  is  acceptable, 
the  Taylor  coefficient  of  each  group  are  immediately 
updated.  When  this  approximation  is  rejected,  the 
largest  group  is  split  in  two  parts  and  each  half  in¬ 
teracts  with  the  group  that  was  not  subdivided.  The 
subroutine  calls  itself  with  smaller  and  smaller  groups 
until  the  error  estimate  is  small  enough  or  until  the 
groups  cannot  be  subdivided  anymore.  In  the  latter 
case,  between()  does  not  check  the  error  estimate  but 
immediately  proceeds  with  the  pairwise  interaction  of 
the  vortices. 

Either  alternative  concludes  the  interaction  of  the 
two  groups  involved  in  the  last  call  to  between() 
which  returns  to  the  subroutine  that  called  it.  Be¬ 
fore  the  original  call  to  between()  returns,  all  the 
between()  subroutines  called  in  the  process  must  re¬ 
turn  as  well.  For  the  user,  it  appears  that  all  velocities 
are  computed  by  a  single  call  to  within(top),  then  the 
two  subroutines  will  call  themselves  thousands  of  times 
until  all  interactions  have  been  aurcounted  for. 

At  the  end  of  this  process,  some  of  the  velocities 
have  been  directly  assigned  to  the  individual  vortices 
but  most  of  the  information  about  the  velocity  field  lies 
in  the  Taylor  coefficients  of  the  groups.  Since  the  quan¬ 
tity  that  is  updated  is  the  location  of  the  particles,  the 
information  accumulated  in  these  coefficients  has  to  be 
transferred  downward  to  the  appropriate  vortices.  The 
Taylor  series  of  each  group  could  be  evaluated  at  all 
the  appropriate  locations  but  instead,  shifting  opera¬ 
tions  are  used  again.  This  procedure  is  similar  to  the 
one  that  took  place  to  find  the  multipole  coefficients 
with  the  distinction  that  it  proceeds  from  the  top  to 
the  bottom  of  the  data  structure.  The  Taylor  series  of 
the  parent  groups,  centered  around  Zj.,  are  systemati¬ 
cally  shifted  toward  the  center  of  their  children  group, 
z'.  The  shifted  coefficients  are 
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=  E  (*)«  -  (20) 

and  are  simply  added  to  the  existing  ones.  After  they 
have  received  the  contribution  from  their  parent  node, 
the  updated  coefficients  are  shifted  downward  to  their 
own  children.  The  process  stops  when  the  bottom  of 
the  data  structure  is  reached;  the  Taylor  series  of  the 
smallest  groups  are  then  evaluated  at  each  particle  lo¬ 
cation.  Greengard  &  Rokhlin  have  shown  that  the  er¬ 
ror  estimate  of  Eq.(15)  is  not  affected  by  these  shifting 
operations.  At  this  point,  the  velocity  of  each  vortex 
blobs  is  known  and  an  ODE  solver  is  used  to  update 
its  location.  New  multipole  expansions  are  built  from 
the  new  locations  and  the  next  velocity  evaluation  can 
take  place. 

Appel’s  data  structure  is  Lagrangian  since  it  is 
built  on  top  of  the  vortices  and  moves  with  them.  It 
can  be  used  for  many  time  steps,  but  eventually,  the 
groups  will  deform  and  could  even  begin  to  overlap. 
They  would  not  be  as  compact  as  the  original  groups 
and  the  fast  algorithm  performance  would  deteriorate. 
To  prevent  this,  the  original  data  structure  is  discarded 
every  few  time  steps  (10  is  a  typical  number)  and  new 
groups  are  identified  from  scratch  by  alternatively  di¬ 
viding  the  vortices  along  the  “x”  and  “y”  axis. 

The  data  structure  used  by  Greengard  &  Rokhlin 
is  based  on  a  spatial  decomposition  of  the  computa¬ 
tional  domain  and  consequently,  has  an  Eulerian  na¬ 
ture.  The  domain  is  subdivided  into  four  square  cells 
of  equal  area.  The  cells  that  contain  more  than  Jmin 
vortices  are  subdivided  again  and  so  forth.  As  the  vor¬ 
tices  move,  they  have  to  be  sorted  again  in  this  set  of 
rigid  boxes.  This  step  requires  little  work  but  com¬ 
plicates  a  parallel  implementation  as  vortices  have  to 
be  exchanged  between  processors  after  each  time  step. 
On  the  other  hand,  the  acceptable  pairs  of  groups  are 
known  a  priori  when  when  a  rigid  data  structure  is 
used  and  a  parallel  implementation  can  benefit  from 
this  predictability  (see  Katzenelson  [4]). 

Fast  algorithm  performance 
To  evaluate  the  performance  of  the  fast  algorithm, 
velocities  are  computed  for  N  vortices  randomly  dis¬ 
tributed  over  a  1  X  1  square  computational  domain; 
their  circulation  is  also  assigned  randomly.  For  fast 
algorithms  based  on  multi-range  approximations,  this 
problem  is  actually  a  worse  case  scenario.  When  the 
vortex  blobs  are  spread  nearly  uniformly,  the  groups 


have  to  be  created  artificially  ^Uld  cannot  be  as  copi- 
pact  as  the  ones  obtained  in  a  problem  where  the  vor¬ 
tices  are  naturally  clustered. 

In  any  event,  the  velocities  are  first  computed  to 
double  precision  accuracy  with  the  method.  This 
is  considered  as  the  exact  solution  and  is  used  as  a  ref¬ 
erence  against  which  the  approximate  velocities  can 
be  compared.  The  combination  of  L  eind  c  are  chosen 
in  such  a  way  that  results  obtained  with  the  fast  al¬ 
gorithm  are  indistinguishable  from  a  single  precision 
accuracy  simulation.  This  is  a  very  severe  restric¬ 
tion  since  the  numerical  integration  of  these  velocities 
in  time  is  certainly  not  accurate  to  one  part  in  a  mil¬ 
lion.  However,  as  pointed  out  by  Barnes  &  Hut,  the 
error  due  to  the  group  to  group  approximations  could 
accumulate  over  many  time  steps  so  that  one  cannot 
allow  too  large  an  error  at  any  given  time  step.  In  the 
proposed  scheme,  the  same  data  structure  is  used  for 
many  time  steps  auid  as  a  result,  the  error  vectors  are 
correlated  over  a  few  time  steps.  In  any  event,  it  is 
preferable  that  the  presence  of  the  fast  algorithm  be 
as  inconspicuous  as  possible. 


Logi«(N) 

Figure  2  Performance  of  the  fast  algorithm 


Despite  this  severe  requirement,  Fig(2)  shows  a 
remarkable  speed-up  over  the  classical  approach.  The 
CPU  times  are  expressed  in  VAX  750  seconds.  The 
crossover  occurs  for  as  few  as  150  vortices;  at  this 
point,  the  extra  cost  of  maintaining  the  data  structure 
is  balanced  by  the  savings  associated  with  the  approx¬ 
imate  treatment  of  the  far  field.  When  N  is  increased 
further,  the  savings  outweigh  the  extra  bookkeeping 
and  the  proposed  algorithm  is  faster  than  its  com¬ 
petitor  by  a  margin  that  increases  with  the  number 
of  vortices.  If  it  is  clear  that  the  computer  require¬ 
ment  of  the  classical  approach  grows  like  the  square 
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on  the  number  of  vortices,  it  is  not  as  simple  to  deter¬ 
mine  the  growth  rate  for  the  fast  algorithm.  Because  it 
only  involves  group  to  vortex  interactions,  the  Barnes 
&  Hut  scheme  can  be  shown  to  be  0(N  log  N).  Group 
to  group  interactions,  such  as  presented  here,  remove 
some  redundancy  present  in  the  Barnes  &  Hut  scheme 
but  at  the  same  time,  prevent  an  aneilysis  based  on 
the  behavior  of  individual  particles.  While  allowing 
these  interactions,  Greengard  &  Rokhlin  used  their 
rigid  data  structure  to  put  an  upper  bound  to  the 
number  of  floating  point  operations.  It  was  deter¬ 
mined  that  their  algorithm  is  actually  0{N).  In  the 
proposed  algorithm,  the  flexible  data  structure  pre¬ 
vents  that  systematic  operation  count  and  the  time 
complexity  cannot  be  determined  analytically  but  is 
at  most  0{N  log  N).  The  two  decades  worth  of  data 
shown  on  Fig.(2)  are  not  enough  to  determine  the  time 
complexity  “experimentally” .  From  this,  one  can  only 
conclude  that  the  difference  between  0(N  log  N)  and 
0{N)  makes  very  little  difference  in  practice.  What  is 
really  important  is  the  constant  multiplying  the  lead¬ 
ing  order  term. 

The  use  of  recursive  subroutines  to  search  through 
the  binary  tree  for  acceptable  interactions  does  not 
lend  itself  to  vectorization.  However,  it  is  still  true  that 
the  interactions  are  independent  events.  The  influence 
of  A  on  B,  where  A  and  B  can  be  either  vortices  or  groups 
of  vortices,  can  be  determined  without  any  regard  to 
the  vorticity  field  that  surrounds  them.  That  inherent 
parallelism  can  be  exploited  to  implement  the  method 
on  concurrent  processors. 

Hypercube  implementation 

The  fast  algorithm  discussed  in  the  previous  sec¬ 
tions  was  implemented  on  the  Caltech-JPL  Marklll 
hypercube.  This  MIMD  machine  is  a  Motorola  68020- 
based  multi-processor  with  4  Megabytes  of  memory 
per  node.  Up  to  128  processors  can  be  connected  in 
an  hypercube  topology. 

The  quality  of  the  parallel  implementation  is  de¬ 
fined  as 


the  concurrent  efficiency,  where  P  is  the  number  of 
proces,sors  and  S  is  the  speed-up  obtained  over  the 
same  application  running  on  a  single  processor. 


While  load  imbalance  dominates  the  overhead  for 
the  concurrent  fast  algorithm,  it  is  not  a  problem  for 
the  parallel  method  which  is  known  to  be  very 
efficient  (see  Fox,  Johnson  et  al.  [5]).  In  that  frame¬ 
work,  any  pair  of  vortices  represents  the  same  amount 
of  work  and  the  load  can  be  perfectly  balanced  by  as¬ 
signing  the  same  number  of  vortices  to  each  processor. 
Furthermore,  the  domain  decomposition  can  be  done 
without  paying  any  attention  to  the  location  of  the 
vortices.  To  find  the  velocities,  each  processor  makes 
a  copy  of  its  vortices  and  sends  it  to  half  of  the  other 
processors  where  it  interacts  with  the  resident  vortices. 
The  contributions  to  the  velocities  of  the  visiting  copy 
are  accumulated  as  it  is  sent  from  processor  to  pro¬ 
cessor.  Ultimately,  it  is  sent  back  to  its  original  pro¬ 
cessor  where  these  contributions  are  added  to  those  of 
the  copy  that  stayed  there.  A  large  aunount  of  data 
has  to  be  exchanged  between  processors  but  this  ap¬ 
plication  is  so  computer  intensive  that  the  time  spent 
computing  velocities  dwarfs  the  communication  time 
and  efficiencies  close  to  unity  can  be  achieved  for  large 
problems.  The  regularity  of  the  problem  also  allows 
a  synchronous  implementation  which  further  reduces 
the  time  spent  communicating  between  the  nodes. 


Figure  3  Data  structure  assigned  to  processor  1 
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The  global  nature  of  the  approach  has  made  its 
parallel  implementation  fairly  straightforward.  How¬ 
ever,  that  character  was  drastically  changed  by  the 
fast  algorithm  as  it  introduced  a  strong  component  of 
locality.  Globality  is  still  present  since  the  influence  of 
particle  is  felt  throughout  the  domain,  but  more  care 
and  computational  effort  is  given  to  its  near  field.  The 
fast  parallel  algorithm  has  to  reflect  that  dual  nature, 
otherwise  an  efficient  implementation  will  never  be  ob¬ 
tained.  Moreover,  the  domain  decomposition  can  no 
longer  ignore  the  spatial  distribution  of  the  vortices. 
Nearby  vortices  are  strongly  coupled  computationally. 
Hence,  it  makes  sense  to  assign  them  to  the  same  pro¬ 
cessor.  The  binary  tree  data  structure  could  be  used 
for  that  purpose.  By  dismissing  the  (P  —  1)  largest 
groups  in  the  data  structure,  P  groups  containing  ap¬ 
proximately  the  same  number  of  vortex  blobs  can  be 
identified  and  a  different  processor  is  assigned  the  re¬ 
sponsibility  of  each  of  these  subtrees.  For  example, 
Fig.(3)  shows  the  portion  of  the  data  structure  as¬ 
signed  to  processor  1  in  a  four  processor  environment. 

This  strategy  ensures  that  the  vortices  given  to 
each  processor  are  actually  neighbors  in  the  physical 
space.  The  drawback  of  this  approach  is  that  the  full 
data  structure  has  to  be  constructed  in  the  host  proces¬ 
sor  before  portions  of  it  can  be  sent  to  the  hypercube. 
In  practice,  binary  bisection  is  used  in  the  host  to  spa¬ 
tially  decompose  the  domain.  Then,  only  the  vortices 
are  sent  to  the  processors  where  a  binary  tree  is  locally 
built  on  top  of  them.  Less  data  has  to  be  loaded  on 
the  hypercube  and  the  generation  of  the  local  binary 
trees  can  be  done  in  parallel. 

In  a  fast  algorithm  context,  sending  a  copy  of  lo¬ 
cal  data  structure  to  half  the  other  processors  does 
not  necessarily  result  in  a  load  balanced  implementa¬ 
tion.  The  work  associated  with  processor  to  processor 
interactions  now  depends  on  their  respective  location 
in  physical  space.  Besides,  a  processor  whose  vortices 
are  located  at  the  center  of  the  domain  is  involved  in 
more  costly  interactions  than  a  peripheral  processor. 
To  achieve  the  best  possible  load  balancing,  that  cen¬ 
tral  processor  could  send  a  copy  of  its  data  to  more 
than  half  of  the  other  processors  and  hence,  be  itself 
responsible  for  a  smaller  fraction  of  the  work  associ¬ 
ated  with  its  vortices. 

Before  a  decision  is  made  on  which  one  is  going 
to  visit  and  which  one  is  going  to  receive,  the  number 
of  pairs  of  processors  that  need  to  exchange  their  data 
structure  needs  to  be  minimized.  Following  the  do¬ 
main  decomposition,  the  portion  of  the  data  structure 
that  sits  above  the  subtrees  is  not  present  anywhere 


in  the  hypercube.  That  gap  is  filled  using  recursive 
doubling  to  make  the  description  of  the  largest  group 
of  every  processor  known  to  everybody  else.  By  limit¬ 
ing  the  broaulcast  to  one  group  per  processor,  a  small 
amount  of  data  is  actually  exchanged  but,  as  seen  on 
Fig. (4),  this  step  gives  every  processor  a  coarse  de¬ 
scription  of  its  surroundings  and  helps  it  find  its  place 
in  the  universe. 


Figure  4  Data  structure  known  to  processor  1  after 
broadcast 

If  the  vortices  of  processor  A  are  far  enough  from 
those  of  processor  B,  it  is  even  possible  to  use  that 
coarse  description  to  compute  the  interaction  of  A  and 
B  without  an  additional  exchange  of  information.  The 
far  field  of  every  processor  caui  be  quickly  disposed  of. 
After  thinking  globally,  one  now  has  to  act  locally;  if 
the  vortices  of  A  are  adjacent  to  those  of  B,  a  more 
detailed  description  of  their  vorticity  field  is  needed  to 
compute  their  mutual  influence. 

This  requires  a  transfer  of  information  from  ei¬ 
ther  A  to  B  or  from  B  to  A.  In  the  latter  case,  most  of 
the  work  involved  in  the  A-B  interaction  takes  place  in 
processor  A.  Obviously,  processor  B  should  not  always 
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send  its  information  away  since  it  would  then  remains 
idle  while  the  rest  of  the  hypercube  is  working.  Load 
balancing  concerns  will  dictate  the  flow  of  information. 
To  do  so,  a  list  of  ail  the  interactions  requiring  a  fur¬ 
ther  data  exchange  is  drawn  in  every  processor.  Since 
the  upper  portion  of  the  tree  has  been  duplicated  P 
times,  an  identical  copy  of  that  list  is  created  simul¬ 
taneously  in  every  processor.  Then  the  responsibility 
of  each  item  in  the  list  is  assigned  to  either  proces¬ 
sors  involved  while  trying  to  distribute  the  resulting 
computational  load  as  equally  as  possible. 

Since  vortices  move  only  slightly  during  each  time 
step,  the  computational  work  required  for  the  interac¬ 
tion  of  two  given  processors  at  the  previous  time  step 
can  be  used  as  an  estimate  of  the  work  involved  for 
the  present  one.  The  pairs  in  the  list  are  examined 
sequentially;  the  processor  with  the  lightest  work  load 
when  the  pair  is  considered  is  given  the  responsibility 
of  the  interaction  and  computes  the  interaction  after 
receiving  the  data  structure  from  the  other  one.  The 
work  load  that  is  used  to  make  that  decision  is  the  sum 
of  the  work  estimates  already  assigned  to  the  processor 
plus  half  of  the  estimates  of  the  interactions  in  which 
that  processor  is  involved  but  have  yet  to  be  assigned. 
Ultimately,  every  processor  knows  not  only  Wuere  to 
send  its  data  but  also  from  which  processor  it  should 
expect  to  receive  additional  information.  The  latter 
will  be  referred  to  as  the  request  list  of  a  processor. 

The  first  round  of  communication  can  now  take 
place.  To  ensure  that  processors  are  not  overloaded 
with  data,  information  is  sent  upon  request  only.  Each 
processor  first  checks  if  it  is  at  the  top  of  the  request 
list  of  any  other  processors.  If  so,  it  immediately  sends 
a  copy  of  its  data  structure  to  the  proper  recipient(s). 
Every  processor  receives  one  and  only  one  visiting  data 
structure.  As  soon  as  it  arrives,  this  structure  interacts 
with  the  local  groups  and  vortices.  Upon  completion  of 
that  operation,  the  processor  which  was  responsible  for 
the  interaction  sends  a  message  to  the  next  processor 
in  its  own  request  list  to  let  that  processor  know  that 
a  copy  of  its  data  is  now  needed  at  a  specific  location. 
The  updated  velocities  and  Taylor  series  coefficients 
of  the  visitor  are  also  sent  back  to  their  origin  where 
they  are  added  to  the  local  data  structure.  Processors 
frequently  peek  at  their  message  queue  to  make  sure 
that  requests  get  an  immediate  answer  and  that  the 
returning  information  is  absorbed  as  quickly  as  possi¬ 
ble.  This  keeps  the  message  queue  to  a  manageable 
size.  The  processor  that  has  just  sent  the  request  then 
has  to  wait  for  the  arrival  of  the  next  visitor. 

To  reduce  the  idle  time,  more  than  one  request 


can  be  filled  at  the  beginning  of  the  process  creating 
a  stack  of  visitors  in  every  processor.  Each  processor 
receives  two  or  three  visitors  and  gets  to  work  as  soon 
as  the  first  one  arrives;  the  other  ones  are  left  in  the 
stack.  When  the  first  interaction  is  completed  and  the 
next  request  sent,  a  processor  can  already  start  work¬ 
ing  on  the  next  visitor  in  its  stack.  Memory  restric¬ 
tions  limit  the  stack  size  to  two  or  three  visiting  data 
structures.  When  all  visiting  copies  have  returned  to 
their  origin,  the  processors  consider  the  interactions 
among  their  own  vortices.  Then,  the  vortices  location 
and  the  whole  data  structure  are  updated.  The  process 
starts  over  again  by  broadcasting  the  largest  group  of 
each  processor. 

Obviously,  this  message  sending  takes  place  asyn¬ 
chronously.  Furthermore,  the  Marklll  is  considered  as 
a  collection  of  computers  loosely  connected  through  an 
arbitrary  network;  the  hypercube  topology  is  not  used 
as  such. 

At  this  point,  it  should  be  noted  that  the  data 
structure  used  in  a  parallel  implementation  differs  sig¬ 
nificantly  from  the  one  used  on  a  sequential  computer. 
In  the  latter  case,  the  parent  group  points  toward  his 
children  using  memory  addresses.  On  concurrent  com¬ 
puters,  the  local  binary  trees  are  exchanged  between 
processors  and  addresses  that  were  valid  where  the  tree 
was  constructed  are  meaniiigiesb  in  a  different  proces¬ 
sor. 

Instead,  the  data  structure  is  built  inside  a  one  di¬ 
mensional  array  and  parent  groups  refer  to  their  chil¬ 
dren  by  their  indices.  Two  arrays  are  actually  used, 
one  for  the  vortices,  V[  ],  and  one  for  the  groups, 
G[  ].  When  additional  information  is  requested,  G[  ] 
is  sent  immediately;  then  the  respective  location  of  the 
processors  vortices  is  considered  to  determine  if  V[  ] 
should  follow.  If  the  processors  are  adjacent,  the  full 
description  of  the  vorticity  field  is  needed  but  if  they 
are  sufficiently  far  away,  the  description  provided  by 
the  groups  is  adequate  and  V[  ]  can  stay  home. 

Efficiency  of  parallel  implementation 

Since  our  objective  is  to  compute  the  flow  around 
a  cylinder,  the  efficiency  of  the  parallel  implementation 
was  tested  on  such  a  problem.  The  region  for  which 
1  <  r  <  1.6  is  uniformly  covered  with  /V  particles.  The 
parallel  efficiency,  as  defined  in  Eq.(21),  is  shown  on 
Fig.(5)  as  a  function  of  the  hypercube  size.  The  paral¬ 
lel  implementation  is  fairly  robust  as  t  remains  lairger 
than  0.7  for  a  32-node  concurrent  computer  meaning 
that  a  typical  processor  does  useful  work  at  least  70% 
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of  the  time.  The  number  of  vortices  per  processor  was 
kept  roughly  constant  at  1500  even  if  the  parallel  effi¬ 
ciency  is  not  a  strong  function  of  the  problem  size. 

It  is,  however,  much  more  sensitive  to  the  quality 
of  the  domain  decomposition.  The  fast  parallel  algo¬ 
rithm  performs  better  when  all  the  sub-domains  have 
approximately  the  same  squarish  shape  or  in  other 
words,  when  the  largest  group  assigned  to  a  proces¬ 
sor  is  as  compact  as  possible. 


Figure  5  Parallel  efficiency  of  the  fast  algorithm. 


The  results  of  Fig.(5)  were  obtained  at  early  times 
when  the  Lagrangian  particles  are  still  distributed 
evenly  around  the  cylinder  which  makes  the  domain 
decomposition  an  easier  task.  At  later  times,  the  dis¬ 
tribution  of  the  vortices  does  not  allow  the  decompo¬ 
sition  of  the  domain  into  P  groups  having  approxi¬ 
mately  the  same  radius  and  the  same  number  of  vor¬ 
tices.  Some  subdomains  cover  a  larger  region  of  space 
and  as  a  result,  the  efficiency  drops  to  approximately 
0.6.  This  is  mainly  due  to  the  fact  that  more  proces¬ 
sors  end  up  in  the  near  field  of  a  processor  responsible 
for  a  large  group;  the  request  lists  are  longer  and  more 
data  has  to  be  moved  between  processors. 

The  sources  of  overhead  corresponding  to  Fig.(5) 
are  shown  on  Fig. (6)  normalized  with  the  useful  work. 
Load  imbalance,  the  largest  overhead  contributor,  is 
defined  as  the  difference  between  the  maximum  useful 
work  reported  by  a  processor  and  the  average  useful 
work  per  processor.  It  is  a  measure  of  how  much  faster 
the  simulation  would  have  been  if  the  load  had  been 
equally  divided  among  the  processors.  Secondly,  the 
extra  work  includes  the  time  spent  making  a  copy  of 
one’s  own  data  structure,  the  time  required  to  absorb 


the  returning  information  and  the  work  that  was  du¬ 
plicated  in  all  processors,  namely,  the  search  for  ac¬ 
ceptable  interactions  in  the  upper  portion  of  the  tree 
and  the  subsequent  creation  of  the  request  lists.  The 
remaining  overhead  has  been  lumped  under  communi¬ 
cation  time  although  most  of  it  is  probably  idle  time 
(or  synchronization  time)  that  was  not  included  in  the 
definition  of  load  imbalance. 
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Figure  6  Load  imbalance  (solid),  communication  & 
synchronization  time  (dash)  and  extra  work 
(dot-dash)  as  a  function  of  the  number  of 
processors. 


It  was  originally  expected  that  as  F  increases,  the 
near  field  of  a  processor  would  eventually  contain  a 
fixed  number  of  neighboring  processors.  The  length  of 
the  request  lists  and  the  load  imbalance  would  then 
reach  an  asymptote  and  the  loss  of  efficiency  would  be 
driven  by  the  much  smaller  communication  and  extra 
times.  However,  this  has  yet  to  happen  at  32  proces¬ 
sors  and  the  communication  time  is  already  starting 
to  make  an  impact.  Nevertheless,  the  fast  algorithm, 
its  reasonably  efficient  parallel  implementation  and  the 
speed  of  the  Marklll  have  made  possible  simulations 
with  as  many  as  80,000  vortex  particles. 
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Abstract 

A  parallel  algorithm  for  the  solution  of  the  2D 
compressible  Navier-Stokes  equations  has  been 
developed  euid  demonstrated  on  a  distributed 
memory  multicomputer.  The  algorithm  represents 
an  extension  of  an  earlier  parallel  incompressible 
pressure  correction  algorithm  developed  by  the 
author.  The  new  algorithm  features  a  revised  formu¬ 
lation  of  the  pressure  correction  equation  that 
simultaneously  updates  both  velocity  and  density  to 
enforce  continuity,  and  uses  upwinding  of  the  densi¬ 
ties  to  allow  shock  capturing.  The  parallel  imple¬ 
mentation  is  based  on  a  full  two-dimensional 
domain  decomposition.  As  in  the  earlier  algorithm, 
an  effective  block  correction  procedure  is  found  to 
be  the  key  to  high  parallel  efficiency.  Results  are 
obtained  on  a  32-node  Intel  iPSC/2VX  hypercube 
for  inviscid  and  viscous  transonic  flows  in  tur¬ 
bomachinery  blade  rows.  Performance  approaching 
1/4  that  of  a  single-processor  Cray  Y-MP  is 
achieved. 

Introduction 

Traditionally,  different  methods  have  been  used 
to  solve  compressible  and  incompressible  flows. 
Time-marching  methods  such  as  Jameson’s  explicit 
Runge-Kutta  scheme  [1]  and  the  Beam-Wenming 
implicit  scheme  [2]  are  commonly  used  for  the  solu¬ 
tion  of  compressible  flows.  These  methods  treat  the 
continuity  equation  as  an  equation  for  the  density, 
and  then  extract  the  static  pressure  from  the  equa¬ 
tion  of  state.  Such  methods  fail  in  the  incompressi¬ 
ble  limit  of  zero  Mach  number  since  the  density 
becomes  independent  of  the  pressure,  and  the  pres¬ 
sure  cannot  be  calculated  from  the  density.  Pressure 
correction  methods,  in  which  the  pressure  is  solved 
for  directly,  rather  than  the  density,  have  proven 
highly  successful  for  incompressible  flows  [3].  The 
pressure  is  calculated  via  an  equation  for  pressure 
corrections  which  is  derived  by  algebraic  manipula¬ 
tions  of  the  discrete  momentum  and  continuity 
equations. 


In  recent  years,  the  pressure  correction  formula¬ 
tion  has  been  extended  to  handle  compressible  flow 
(see,  for  example  [4,5]).  The  resulting  algorithm  has 
the  attractive  property  of  being  able  to  address  invis- 
dd,  laminar,  and  turbulent  flows  at  all  Madi 
numbers,  making  it  very  widely  applicable. 

In  some  earlier  papers  [6,7],  the  present  author 
described  the  development  of  a  parallel  pressure 
correction  algorithm  for  laminar  and  turbulent 
incompressible  flows,  and  its  implementation  on  a 
distributed  memory  multicomputer.  The  parallel 
algorithm  was  based  on  a  stripwise  domain  decom¬ 
position,  and  the  development  of  an  effective  paral¬ 
lel  block  correction  procedure  vdiich  eliminated  the 
convergence  penalty  caused  by  the  domain  decom¬ 
position.  Speedups  in  excess  of  20  were  achieved 
with  32  scalar  processors  on  an  Intel  iPSC/2,  and 
performance  with  8  vector  processors  using  an  Intel 
iPSC/2VX  approached  l/5th  that  of  a  single  pro¬ 
cessor  Cray-XMP. 

The  work  described  in  this  paper  represents  an 
extension  of  the  earlier  algorithm  to  a  parallel 
compressible  pressure  correction  algorithm  applica¬ 
ble  at  all  Mach  numbers.  The  original  stripwise 
domain  decomposition  has  been  replaced  with  a  full 
two-dimensional  decomposition,  to  allow  for  the  use 
of  more  processors.  The  paper  focuses  on  the 
highlights  of  the  compressible  formulation  and  the 
implementation  of  the  block  correction  procedure 
on  a  two-dimensional  mesh  of  processors.  Hnally 
the  performance  of  the  algorithm  is  demonstrated 
for  two  test  calculations  involving  inviscid  and 
viscous  transonic  flow  in  a  turbomachinery  blade 
row. 

Highlights  of  the  Compressible  Formulation 

The  compressible  pressure  correction  algorithm 
developed  here  solves  the  two-dimensional  Navier- 
Stokes  equations  for  viscous  flow  or  the  Euler  equa¬ 
tions  for  inviscid  fbw.  The  equations  for  conserva¬ 
tion  of  x-momentum,  y-momentum,  and  mass  are 
solved,  along  with  the  equation  of  state.  For  now. 
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the  temperature  has  been  calculated  from  the 
assumption  of  constant  stagnation  enthalpy,  rather 
than  by  soMng  the  compressible  form  of  the  energy 
equation.  For  turbulent  flows,  the  standard  k-  e  tur¬ 
bulence  model  is  used,  along  with  the  wall  function 
treatment  for  the  near-wall  regions  [8]. 

The  governing  equations  are  expressed  in  Carte¬ 
sian  coordinates,  and  then  transformed  to  a  general 
body-fitted  coordinate  system  f=f(x,y),  ri=r^x,y).  In 
the  interest  of  space,  the  reader  is  referred  to  Refer¬ 
ences  9  for  complete  details. 

In  the  pressure  correction  approach,  the 
momentum  equations  are  first  solved  with  a  guessed 
pressure  field  p  .  The  resulting  velocity  field  does 
not  necessarily  satisfy  continuity.  The  pressure 
correction  equation  is  obtained  by  substituting 
simplified  forms  of  the  momentum  equations  into 
the  continuity  equation  to  obtain  an  equation  for  the 
pressure  correction  p  ,  defined  such  that  the 
corrected  pressure  p  is  given  by 

p=p'+p'  (1) 

Once  the  pressure  corrections  are  obtained  from  the 
solution  of  this  equation,  the  pressure  is  updated 
and  the  velocities  are  corrected  to  satisfy  continuity. 

The  distinction  between  the  incompressible  and 
compressible  pressure  correction  algorithms  stems 
from  the  fact  that  the  density  is  taken  as  fixed  during 
the  course  of  the  pressure  correction  in  the 
incompressible  algorithm,  while  it  is  taken  to  be  a 
function  of  pressure  in  the  compressible  algorithm 
[5].  A  density  correction  p  is  defined  such  that 

P=c<’p'  (2) 

The  mass  flux  terms  in  the  discrete  continuity  equa¬ 
tion  can  then  be  decomposed  into  four  parts,  i.e. 

pU  =  p*t/*  +  p'U'  +  p'U"  +  pU'  (3) 

The  first  two  terms,  which  are  the  only  terms 
present  in  the  incompressible  form,  represent  the 
mass  flux  calculated  from  the  given  density  and  velo¬ 
city  fields,  and  the  contributions  from  the  velocity 
corrections.  The  last  two  terms  are  the  contribu¬ 
tions  arising  from  compressibility,  representing  the 
linear  contribution  from  the  density  corrections,  and 
the  nonlinear  contribution  from  the  compressibility 
effect,  respectively.  In  contrast  to  earlier  works,  the 


nonlinear  term  is  retained,  since  it  helps  to  stabilize 
the  procedure  in  the  early  iterations  when  neither  p 
or  C/  is  necessarily  small.  The  nonlinear  terms  are 
lagged  during  the  course  of  the  iterative  solution  of 
the  pressure  correction  equation  at  each  iteration  of 
the  overall  procedure. 

The  addition  of  the  compressible  terms  results  in 
a  nonlinear  convection-diffusion  equation  for  the 
pressure  corrections,  rather  than  the  linear  diffusion 
equation  obtained  in  the  incompressible  algorithm. 
Eiqperience  has  shown  that  the  compressible  pres¬ 
sure  correction  equation  is  more  difficult  to  solve 
than  its  incompressible  cousin,  and  that  close 
enforcement  of  continuity  at  each  iteration  is  even 
more  crucial  for  success  of  the  compressible  algo¬ 
rithm  than  in  the  incompressible  case. 

Upwinding  of  the  densities  provides  the  mechan¬ 
ism  for  shock  capturing  in  transonic  and  supersonic 
flows.  Although  the  first-order  accurate  hybrid 
differencing  scheme  is  still  widely  used  in 
incompressible  flows,  it  leads  to  excessive  smearing 
when  shocks  are  present  and  excessive  total  pressure 
errors.  In  this  work,  the  hybrid  scheme  was  replaced 
by  the  conservative  second-order  accurate  QUICK 
[10]  scheme.  QUICK  was  found  to  capture  shocks 
within  3-4  grid  cells  and  lead  to  significantly  better 
total  pressure  conservation  in  inviscid  flows. 

Parallel  Implementation 

The  basic  parallel  implementation  of  the  pres¬ 
sure  correction  algorithm  using  a  stripwise  decom¬ 
position  was  described  in  detail  in  [6,7],  and  will  not 
be  repeated  here.  In  this  work,  a  full  two- 
dimensional  decomposition  was  used.  A  2D  decom¬ 
position  has  the  advantage  that  it  allows  the  use  of 
more  processors  than  a  ID  decomposition  ,  which  is 
limited  to  a  number  of  processors  no  greater  than 
the  number  of  ceils  in  any  one  direction.The  major 
impact  of  using  the  2D  decomposition  on  the  paral¬ 
lel  algorithm  was  on  the  implementation  of  the 
block  correction  procedure,  and  on  obtaining  good 
performance  from  the  vector  processors. 

The  two-dimensional  domain  decomposition  is 
done  by  subdividing  the  solution  domain  into  over¬ 
lapping  rectangular  regions  (in  the  transformed 
space).  Overlapping  of  the  solution  domains  is 
required  so  that  each  interior  ceil  is  computed  as  an 
interior  point  by  at  least  one  processor.  An  overlap 
of  one  cell  in  each  direction  at  each  face  was  usedi, 
since  this  minimizes  the  redundant  storage  of  quan¬ 
tities  in  the  overlapping  regions.  The  use  of  a  one 
cell  overlap  in  conjuction  with  the  second-order 
QUICK  scheme  requires  some  special  treatment. 
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since  each  cell  requires  information  from  two  cells 
upstream  when  QUICK  is  used.  Although  the  sim¬ 
plest  way  to  implement  QUICK  would  be  to  use  a 
two  cell  overlap  on  each  edge,  this  leads  to  very 
inefficient  memory  utilization  as  the  number  of  pro¬ 
cessors  gets  large  (for  a  problem  of  fixed  size).  The 
approach  adopted  here  was  to  utilize  four  temporary 
vectors,  one  for  each  interface,  to  store  the  extra  u 
and  V  values  for  the  cells  adjacent  to  the  interface. 
Since  QUICK  is  used  only  in  the  discretization  of 
the  X-  and  y-  momentum  equations,  the  only  penalty 
is  the  cost  of  passing  four  additional  messages  for 
each  momentum  equation  and  the  storage  required 
by  the  four  temporary  vectors,  both  of  which  are 
minimal. 

A  key  finding  in  the  earlier  work  [7]  was  that  an 
efficient  parallel  block  correction  procedure  was  the 
key  to  maintaining  high  parallel  efficiency  as  the 
number  of  processors  was  increased.  The  block 
corrections  eliminate  the  reduction  of  convergence 
rate  of  the  elliptic  pressure  correction  equation  that 
occurs  when  the  solution  is  done  by  a  parallel 
Schwarz  alternating  method,  rather  than  by  an 
implicit  solution  over  the  entire  domain.  In  this 
work,  the  same  block  correction  procedure  was 
implemented,  and  again  found  to  be  crucial  to 
achieving  convergence  rates  essentially  independent 
of  the  number  of  processors.  The  switch  to  a  two- 
dimensional  decomposition  from  the  earlier  strip- 
wise  decomposition  necessitated  significant  changes 
in  its  implementation,  which  are  described  in  the 
following  paragraphs. 

In  the  block  correction  procedure  [11],  a  series 
of  one  dimensional  corrections  are  made  to  the 
solution  of  the  pressure  correction  equation,  first 
over  rows  of  the  grid,  and  then  over  columns.  The 
following  discussion  will  focus  on  the  corrections  on 
the  columns;  the  row  corrections  follow  similarly. 
The  coefficients  for  the  column  corrections  are 
obtained  by  summing  the  original  discretized 
coefficients  across  each  column.  Since  in  a  2D 
decomposition,  no  processor  spans  an  entire  column 
of  grid  cells,  this  summation  step  requires  communi¬ 
cation  among  the  processors  in  a  given  column. 
Notice  that  in  the  original  stripwise  decomposition 
reported  earlier,  the  processors  did  span  entire 
columns  ,  and  this  extra  step  was  not  needed.  Once 
the  coefficients  for  each  row  are  summed,  they  need 
to  be  are  exchanged  among  ail  of  the  processors,  so 
that  each  processor  can  compute  the  corrections 
independently.  The  resulting  correction  equations 
arc  of  tri-diagonal  form,  and  can  easily  solved  using 
the  tri-diagonal  matrix  algorithm  (TDMA).  With  the 
stripwise  decomposition,  the  exchange  of 


coefficients  was  done  over  all  of  the  processors. 
With  the  2D  decomposition,  each  processor  needs 
only  to  exchange  coefficients  with  the  other  proces¬ 
sors  that  share  the  same  row  of  grid  cells. 

The  use  of  binary  reflected  gray  code  (BRGC) 
to  map  the  processors  onto  a  two-dimensional  mesh 
leads  to  the  most  efficient  communication  not  only 
for  the  IcKal  communication  between  neighboring 
processors,  but  also  for  the  rowwise  and  columnwise 
exchanges  of  information  required  by  the  block 
correction  procedure.  Figure  1  shows  a  BRGC  map¬ 
ping  for  a  4  X  4  mesh  of  processors  on  a  four¬ 
dimensional  hypercube.  Note  that  not  only  are  all  of 
the  neighboring  processors  nearest  neighbors  (their 
binary  node  numbers  differ  in  only  one  digit),  but 
also  that  the  numbers  of  all  of  the  processors  in  a 
given  row  share  the  same  first  two  di^ts,  and  all  of 
the  prcKesors  in  a  given  column  share  the  same  last 
two  digits.  This  means  that  both  the  columnwise 
and  rowwise  exchanges  required  by  the  block 
correction  can  be  performed  via  global  shuffles  and 
global  concatenation  operations  on  subcubes  of 
dimension  2.  This  is  not  the  case  if  the  processors 
are  mapped  onto  the  mesh  in  simple  lexicographical 
order. 

Another  important  consideration  in  the  use  of  a 
2D  decomposition  on  a  multicomputer  with  vector 
processors  is  the  need  to  maintain  sufficiently  long 
vector  lengths  for  good  vector  performance.  The 
vector  length  required  to  reach  one-half  of  the  peak 
performance  (the  so-called  half  length)  is  on  the 
order  of  50  words  on  the  iPSC/2VX  vector  proces¬ 
sor  [7].  If  two-dimensional  data  structures  are  used 
in  the  code  (i.e.  A  is  stored  as  A(I,J))  then  the 
resulting  code  will  contained  nested  DO  loops  over  I 
and  J.  For  such  cases,  only  the  innermost  DO  loop 
will  vectorize.  For  a  128  x  64  mesh  on  an  8  x  4  mesh 
of  processors,  each  processor  treats  a  16  x  16  prob¬ 
lem.  The  resulting  vector  length  vnll  be  only  16,  far 
below  the  half  length,  and  vector  performance  will 
be  poor.  The  use  of  one-dimensional  data  structures 
(i.e.  A  is  stored  as  A(IJ),  where  IJ  =  (I-1)*NJ+J) 
results  in  a  vector  length  of  2S6  for  the  same  loop, 
and  near-peak  vector  performance  will  be  achieved. 
For  this  reason,  the  code  developed  here  uses  a 
one-dimensional  data  structure  throughout,  and  util¬ 
izes  a  single  loop  over  all  of  the  grid  cells  whenever 
possible. 

One  final  note  regards  the  choice  of  the  iterative 
solver  used  to  solve  the  linearized  equations  at  each 
stage  of  the  algorithm.  Two  methods  have  been 
implemented,  namely  line-by-line  TDMA  and  a  vec¬ 
torized  point-symmetric  Gauss-Scidel  method  [7]. 
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The  line-by-line  TDMA  is  found  to  pve  faster  con¬ 
vergence  due  to  its  superior  performance  on  the 
compressible  pressure  correction  equation.  How¬ 
ever,  the  vectorized  point  solver  leads  to  better  vec¬ 
tor  performance,  resulting  in  a  significantly  lower 
cost  per  iteration,  at  the  price  of  somewhat  slower 
convergence.  Since  the  optimum  choice  appears  to 
be  problem  dependent,  both  methods  have  been 
retained. 

Test  Results 

The  parallel  pressure  correction  algorithm 
described  here  has  been  tested  on  a  number  of  sub¬ 
sonic,  transonic,  and  supersonic  flows  in  inlets,  chan¬ 
nels,  and  blade  row  cascades.  Inviscid,  laminar,  and 
turbulent  flows  have  been  computed.  Due  to  space 
limitations,  this  section  will  focus  on  two  calcula¬ 
tions  for  transonic  flow  in  a  hypothetical  tur¬ 
bomachinery  cascade.  The  first  case  solves  for  the 
inviscid  flow  through  the  cascade  using  the  Euler 
equations,  while  the  second  case  solves  for  turbulent 
flow  in  the  same  cascade  using  the  Navier-Stokes 
equations  plus  the  k-e  turbulence  model.  The  inlet 
Mach  number  is  0.7,  and  the  flow  enters  the  cascade 
at  a  35  *  inlet  angle.  The  computations  were  per¬ 
formed  on  an  H-grid  with  128  x  64  grid  cells;  a 
closeup  view  of  the  mesh  in  the  vicinity  of  the  lead¬ 
ing  edge  is  shown  in  Figure  2.  The  mesh  was  con¬ 
structed  from  a  112  x  48  mesh  by  subdiving  a 
number  of  the  original  cells  near  the  blade  surfaces 
and  also  by  adding  grid  lines  in  the  anticipated  vicin¬ 
ity  of  the  shock. 

The  performance  of  the  parallel  algorithm  was 
explored  by  running  100  iterations  of  the  algorithm 
on  both  the  scalar  and  vector  processors  of  a  32- 
nodc  Intel  iPSC2/VX.  The  results  for  the  cpu  time 
and  parallel  efficiency  are  shown  in  Figures  3  and  4. 
With  8  vector  processors,  the  code  runs  1,5  times 
faster  with  the  line  solver,  and  3.2  times  faster  with 
the  point  solver,  than  the  code  with  the  line  solver 
on  8  scalar  processors.  With  32  vector  processors, 
the  corresponding  values  both  fall  to  1.4,  due  pri¬ 
marily  to  a  reduction  in  vector  lengths  as  the 
number  of  processors  increases.  The  parallel 
efficiency,  estimated  based  on  the  procedure 
described  in  (7],  is  found  to  be  reasonably  high. 
With  32  processors,  efficiencies  of  80%  are  achieved 
with  scalar  processors,  71%  with  the  line  solver  on 
vector  processors,  and  45%  for  the  point  solver  on 
vector  processors.  Again  the  shorter  vector  lengths 
that  occur  when  more  processors  are  used  are  a 
major  factor  in  reducing  the  parallel  efficiency  for 
the  vector  hypcrcube,  particularly  when  the  vector¬ 
ized  point  solver  is  used. 


A  highly  vectorized  version  of  the  original  serial 
code  was  run  on  a  single  processor  of  a  Cray  Y-MP. 
The  cpu  times  for  100  iterations  were  60.7  seconds 
for  the  line-by-line  TDMA,  and  30.0  seconds  for  the 
vectorized  point  solver.  Hence,  the  performance  of 
the  32-node  hypercube  was  about  l/4th  that  of  the 
Cray  for  the  line  solver,  and  about  l/5th  for  the 
point  solver. 

In  a  transonic  blade  row  of  this  type,  the  block¬ 
age  due  to  the  thickness  of  the  viscous  boundary 
layer  can  have  a  significant  effect  on  the  shock  loca¬ 
tion  and  on  the  resulting  total  pressure  losses.  Fig¬ 
ure  5  shows  the  computed  Mach  number  and  pres¬ 
sure  contours  for  the  inviscid  test  problem.  Figure  6 
shows  the  corresponding  results  for  the  turbulent 
test  problem.  Reasonably  converged  solutions  were 
obtained  in  1500  iterations  in  both  cases.  The  pres¬ 
ence  of  a  shock  upstream  of  the  trailing  edge  of  the 
blade  is  clearly  evident.  The  turbulent  solution 
correctly  predicts  the  influence  of  the  boundary 
layer  blockage  on  the  shock  location,  as  the  shock 
clearly  moves  upstream  from  its  inviscid  location. 
The  mteraction  of  the  shock  with  the  boundary  layer 
causes  the  boundary  layer  to  separate,  leading  to  a 
large  region  of  recirculation  near  the  trailing  edge  in 
the  turbulent  case. 

An  interesting  result  that  is  observed  in  the  invis¬ 
cid  solution  regards  the  apparent  glitches  in  the 
Mach  number  contours.  The  location  of  these 
pitches  corresponds  to  the  regions  of  the  grid  where 
the  ceU  spacing  varies  abruptly  by  a  factor  of  two  as 
a  result  of  the  manner  in  which  the  final  grid  was 
constructed  by  adding  grid  lines  to  the  original  grid. 
The  second-order  QUICK  scheme  responds  to  the 
abrupt  change  in  spacing  by  wiggling.  Notice  that 
the  viscous  solution  shows  no  such  behavior;  evi¬ 
dently  there  is  enough  physical  dissipation  in  the 
viscous  equations  to  suppress  the  wiggles.  The  wig¬ 
gles  in  the  inviscid  case  are  evidence  that  the  level  of 
numerical  dissipation  in  the  QUICK  scheme  is  quite 
low.  Since  any  wakes  observed  in  the  inviscid  solu¬ 
tion  are  also  evidence  of  numerical  dissipation,  the 
small  size  of  the  inviscid  wake  is  further  evidence  of 
the  low  level  of  dissipation  in  the  QUICK  scheme. 
The  lesson  here  is  that  inviscid  test  problems  pro¬ 
vide  a  demanding  challenge  for  any  numerical  for¬ 
mulation,  and  therefore  are  very  useful  for  exploring 
the  accuracy  of  the  formulation,  even  if  the  ultimate 
goal  is  to  analyze  viscous  flows. 

Concluding  Remarks 

A  parallel  compressible  pressure  correction 
algorithm  applicable  at  all  Mach  numbers  has  been 


developed  and  demonstrated  on  a  distributed 
memory  multicomputer.  Reasonable  high  parallel 
efiiciencies  have  been  achieved,  and  performance  up 
to  l/4lh  that  of  a  single  Cray  Y-MP  processor  has 
been  obtained.  Calculations  of  turbulent  transonic 
flow  in  a  turbomachinery  cascade  correctly  predict 
the  effect  of  the  boundary  layer  blockage  on  the 
shock  position  and  the  resulting  losses.  With  the 
advent  of  the  next  generation  of  multicomputers, 
with  faster  processors  (such  as  the  Intel  i860),  and 
faster  interprocessor  communication,  this  parallel 
algorithm  should  achieve  performance  beyond  that 
of  existing  serial  algorithms  on  conventional  super¬ 
computers  and  allow  larger  and  more  accurate 
simulations  to  be  performed. 
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Fig.  1.  Gray  code  mapping  for  4  x  4  processor  mesh 


Fig.  2.  Closeup  of  grid  near  blade  leading  edge 
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Fig.  3.  Performance  of  parallel  algorithm 
- 100  iterations,  inviscid  test  case 
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Fig.  4.  Parallel  efficiency 

100  iterations,  inviscid  test  case 
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Fig.  5.  Inviscid  transonic  cascade  test  case 
(a)  Mach  number  contours 
Pressure  contours 
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Fig.  6.  Turbulent  transonic  cascade  test  case 

(a)  Mach  number  contours 

(b)  Pressure  contours 
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Abstract 

We  present  a  simulation  of  the 
electrosensory  input  of  the  weakly  electric 
fish  Apteronotus  leptorhynchus.  This  fish 
senses  its  environment  by  producing  a 
sinusoidal  voltage  difference  between  its 
body  and  tail  sections,  causing  an  electric 
field  and  a  current  distribution  in  the 
surrounding  water.  If  an  object  is  nearby 
which  has  different  electrical  conductivity 
from  the  surrounding  water,  the  current 
distribution  is  disturbed  on  the  skin  of  the 
fish.  The  fish  senses  this  difference  from  the 
usual  current  distribution,  and  infers  the 
presence  and  location  of  the  object. 

Mathematically,  the  problem  is  to  solve  a 
potential  equation  in  the  domain  exterior  to 
the  fish  with  Cauchy  boundary  conditions, 
in  the  presence  of  an  induced  dipole  arising 


from  the  object,  and  extract  the  potential 
difference  across  the  fish  skin. 

We  have  created  an  unstructured  triangular 
mesh  covering  the  two-dimensional 
manifold  of  the  fish  skin,  using  the 
Distributed  Irregular  Mesh  Environment 
(DIME),  then  used  the  Boundary  Element 
Method  to  solve  for  the  potential  derivative 
at  the  fish  skin. 

The  computational  problem  is  the  solution 
of  a  full  set  of  simultaneous  linear 
equations,  where  there  is  an  equation  for 
each  node  of  the  boundary  mesh,  typically 
about  1(X)  -  200.  We  have  used  an  NCUBE 
hypercube  to  calculate  the  matrix  elements 
and  solve  these  equations,  once  for  each 
relative  position  of  the  fish  and  the  test 
object.  We  present  some  early  results  from 
the  simulation. 
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1.  Biological  Background 

All  animals  are  faced  with  the  computationally  intense 
task  of  continutxisly  acquiring  and  analyzing  sensory 
data  from  their  environment  To  ensure  maximally  useful 
data,  animals  appear  to  use  a  varir'.y  of  motor  strategies 
or  behaviors  to  optimally  po<'''ion  their  sensory 
an)aratus.  In  all  higher  animals,  neural  structures  which 
process  both  sensory  and  motor  information  are  likely  to 
exist  which  car.  coordinate  this  exploratory  behavior  fcv 
the  sake  of  sensory  acquisition.  We  believe  the 
cerebellum  may  be  involved  in  this  motor-sensory  loop. 

To  study  this  possibility,  we  have  chosen  the  weakly 
electric  fish,  which  use  a  unique  electrically  based  means 
of  exploring  their  environment*  •^.  These  nocturnal  fish, 
found  in  muiky  waters  of  the  Congo  and  Amazon,  have 
developed  electrosensory  systems  to  allow  them  to 
detect  objects  without  relying  on  vision.  In  fact,  in  some 
species  this  electric  sense  appears  to  be  their  primary 
sensory  modality. 

This  sensory  system  relies  on  an  electric  organ  which 
generates  a  weak  electric  field  surrounding  the  fish’s 
body  that  in  turn  is  detected  by  specialized 
electroreceptor  cells  in  the  fish’s  skin.  The  presence  of 
animate  or  inanimate  objects  in  the  local  environment 
causes  distortions  of  this  electric  field,  which  are 
interpreted  by  the  fish.  In  some  species  of  weakly  electric 
fish,  the  electric  organ  fires  a  short  pulse  and  then  is 
silent,  in  effect  gating  the  electrosensory  information 
into  the  nervous  system  at  discrete  times  rather  than 
entering  as  a  continuous  stream  like  most  other  sensory 
modalities.  Other  species  sample  their  environment  with 
a  pulse  in  the  frequency  domain,  ie.  by  generating  a 
nearly  sinusoidal  electrical  discharge.  The  simplicity  of 
the  sensory  signal,  in  addition  to  the  distributed  external 
representation  of  the  detecting  apparatus,  makes  the 
weakly  electric  fish  an  excellent  animal  with  which  to 
study  the  involvement  in  sensory  discrimination  of  the 
motor  system  in  general  and  body  position  in  particular. 

It  is  of  value  experimentally  and  also  interesting  to  note 
that  some  of  these  fish  have  the  largest  cerebellum, 
relative  to  their  brain  and  body  mass,  of  any  class  of 
animals.  The  experiments  we  have  undertaken  are 
specifically  aimed  at  understanding  to  what  extent  the 
exploratory  behavior  of  the  fish  involves  coordinated 
positioning  of  both  its  electric  organ  and  its 
electroreceptors  to  resolve  objects  in  its  local  electric 
field. 

Simulations  in  two  dimensions^''*  and  our  measurements 
with  actual  fish  have  shown  that  body  position, 
especially  the  tail  angle,  significantly  alter  the  fields  near 
the  fish’s  skin.  We  are  currently  developing  freeze-frame 


video  techniques  to  be  used  in  combination  with  high 
resolution  electrode  arrays  positioned  in  the  fish  tank  to 
record  fish  behavior  in  response  to  a  variety  of 
environmental  stimuli. 

To  study  quantitatively  how  the  fish’s  behavior  affects 
the  “electric  images”  of  objects,  we  are  developing  three- 
dimensional  computer  simulations  of  die  electric  fields 
that  the  fish  generate  and  detect.  These  simulations, 
when  calibrated  with  the  measured  fields,  should  allow 
us  to  identify  and  focus  on  behaviors  that  are  most 
relevant  to  the  fish’s  sensory  acquisition  tasks,  and  to 
predict  the  electrical  consequences  of  the  behavior  of  the 
fish  with  higher  spatial  resolution  than  possible  in  the 
tank. 

Being  able  to  visualize  the  electric  fields,  in  false  color 
on  a  simulated  fish’s  body  as  it  swims,  may  provide  a 
new  level  of  intuition  into  how  these  curious  animals 
sense  and  respond  to  their  world. 

In  this  paper,  we  discuss  a  physical  model  of  an  electric 
fish,  then  the  equivalent  mathematical  problem,  which  is 
a  solution  of  Laplace’s  equation  in  the  region  exterior  to 
the  fish  and  the  object  it  is  sensing.  We  give  a  brief 
description  of  the  Boundary  Element  method  for  solving 
this  problem,  and  explain  why  this  method  is  well  suited 
for  a  distributed  parallel  architecture.  Finally  we  describe 
some  early  results  from  the  simulation. 

2.  Physical  Model 

We  need  to  reduce  the  great  complexity  of  a  biological 
organism  to  a  manageable  physical  model.  The 
ingredients  of  this  model  are  the  fish  body,  shown  in 
Figure  1 ,  the  object  that  the  fish  is  sensing,  and  the  water 
exterior  to  both  the  fish  and  the  object. 

The  real  fish  has  some  projecting  fins,  and  our  first 
approximation  is  to  neglect  these  because  their  electrical 
properties  arc  essentially  the  same  as  those  of  water. 

Our  second  approximation  is  to  simplify  the  time- 
dependence  of  the  electric  field  set  up  by  the  fish.  The 
time  constant  associated  with  electric  field  variations  in 
a  dielectric  medium  is  of  order  dielectric  constant 
divided  by  conductivity®.  For  water  this  charactarislic 
time  is  measured  in  fractions  of  a  microsecond,  and  for  a 
perfectly  conducting  object  is  zero.  The  time  between 
pulses  of  the  electric  mgan  is  about  a  millisecond  in  A. 
leptorhynchus,  so  that  if  the  fish  is  sensing  a  perfectly 
conducting  object,  it  is  safe  to  ignore  time  variation  and 
model  the  fields  as  static.  For  some  plant  materials, 
however,  this  time  constant  may  be  large,  and  the  fish 
may  sense  phase  information  (analogously  to  humans 
using  the  phase  difference  between  the  ears  to  sense  the 
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direction  of  a  sound). 

In  this  paper,  we  shall  concentrate  on  the  stads 
approximation.  There  is  thus  an  electric  field,  maintained 
by  the  fish,  which  causes  a  current  flow  proportional  to 
the  electric  field  according  to  Ohm’s  law. 

We  will  assume  that  the  fish  is  exploring  a  snudl 
conductive  object,  such  as  a  small  metal  sphere.  First  we 
reduce  the  geometrical  aspect  of  the  object  to  being 
pointlike,  yet  retaining  some  relevant  electrical 
properties.  Except  when  the  object  is  another  electric 
fish,  we  expect  the  object  to  have  no  active  electrical 
properties,  but  only  to  be  an  induced  dipole,  so  that  in  the 
presence  of  an  electric  field  the  object  becomes  a  dipole 
of  strength  proportional  to  the  field  and  oriented  opposite 
to  the  field.  The  proportionality  constant  is  the 
polanzabiiUy  of  the  object. 

Tims  file  poLirizability  is  the  only  parameter  describing 
the  ob  ject  In  Uiis  first  paper,  we  shall  not  attempt  to 
calibrate  experimental  measurements  and  computed 
results,  but  merely  estimate  this  parameter.  Polarizability 
has  the  dimensions  of  volume,  so  we  shall  model  an 
object  of  polarizability  I  cm^,  since  this  is  the  size  of 
object  used  in  the  experiments. 

We  now  come  to  the  modelling  of  the  fish  body  itself. 
This  consists  of  a  skin  with  electroreceptor  cells  which 
can  detect  potential  difference,  and  a  rather  complex 
internal  structure.  We  shall  assume  that  the  source 
voltage  is  maintained  at  the  interface  between  the 
internal  structure  and  the  skin,  so  that  we  need  not  be 


Figure  1.  Top,  top  view  of  the  fish  Apteronoius 
leplorhynchus.  Middle,  side  view  of  the  fish.  The  fish  is 
about  20  cm  long.Bottom,  modelled  voltage  profile  <|> 
along  the  interior  of  the  fish,  from  -100  mV  at  the  tail 
with  a  linear  ramp  to  +25  mV  at  the  head.  The  fins  and 
tail  are  not  shown. 
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Figure  2:  A  section  through  the  fish  skin,  with  electrical 
potential  plotted  vertically.  The  potential  is  assumed  lin¬ 
ear  within  the  skin. 


concerned  with  the  details  of  the  intcinal  structure.  Thus 
the  fish  body  is  modelled  as  two  parts;  an  internal  part 
with  a  given  voltage  distribution  on  its  surface, 
surrounded  by  a  skin  with  variable  conductivity. 

Because  of  the  voltage  on  the  internal  body,  a  current 
distribution  is  set  up  in  the  fish  skin  and  water,  which 
have  different  conductivities.  The  signal  from  the 
electrorecepta-  cells  in  the  skin  is  assumed  to  depend  on 
the  potential  difference  across  the  skin^ 

We  shall  simplify  the  model  a  little  more  by  assuming 
that  the  skin  thickness  is  small  compared  to  the  size  of 
the  fish.  This  is  not  equivalent  to  neglecting  the  skin 
altogether,  since  it  is  the  combination  of  skin  thickness 
and  conductivity  which  determines  its  electrical 
properties;  the  zero-skin-thickness  approximation 
merely  removes  geometrical  complexity  from  the  model 
in  exchange  for  a  slightly  more  complex  boundary 
condition  at  the  surface  of  the  fish  body,  as  discussed 
below. 

Figure  2  shows  a  section  through  the  body  of  the  fish, 
with  a  graph  of  the  voltage  or  potential  superimposed. 
We  define  (|>  to  be  the  potential  at  the  interface  between 
the  fish  skin  and  the  internal  part  of  the  fish,  and  the 
scalar  field  y(x)  to  be  the  potential  field  in  the  water 
exterior  to  the  skin.  The  conductivities  of  the  skin  and 
water  arc  written  o,  and  respectively. 

We  write  the  normal  derivative  of  the  exterior  potential 
as  and  conservation  of  current  then  implies  that  the 


m 


slope  of  the  potential  in  the  skin  be  \|f„  /  Oj.  We  shall 

now  assume  that  the  potential  varies  linearly  from  the 
inside  to  the  outside  of  the  skin;  sufficient  justification 
for  this  would  be  that  either  the  skin  is  thin  compared  to 
the  body  thickness,  or  that  the  source  potential  varies 
slowly  over  the  skin  compared  to  the  skin  thickness. 

Using  the  thickness  t  of  the  skin,  we  find  the  boundary 
condition 


V-^V„  =  <I>  (1) 

where  the  effective  skin  thickness  ^  is  defined  to  be 


This  is  a  Cauchy  or  mixed  boundary  condition  for  the 
exterior  potential. 

Conservation  of  charge  is  again  the  guiding  physical  law 
to  obtain  the  differential  equation  satisfied  by  in  the 
water.  Mathematically,  it  means  that  the  divergence  of 
the  current  density  is  zero;  thus  we  can  use  Ohm’s  law  to 
write 


V  ■  (oV\)/)  =  0 


where  a  is  the  conductivity  of  the  water,  assumed 
uniform,  and  Vy  is  the  electric  field.  This  equation 
reduces  to  Laplace’s  equation. 

3.  Mathematical  Theory 

The  Boundary  Element  method^’®  has  been  used  for 
many  applications  where  it  is  necessary  to  solve  a  linear 
elliptic  partial  differential  equation.  The  derivation  is 
particularly  simple  for  the  case  of  Laplace’s  equation, 
which  we  present  with  less  than  complete  mathematical 
rigor. 

Green’s  theorem  states  that  if  functions  U  and  V  are  free 
of  singularities  in  a  domain  Q,  with  the  normal  outward 
from  Q,  then 


We  define  V  to  be  the  desired  solution  y,  and  for  some 
fixed  point  p,  we  set 


Since  =  0  and  =  -4n5(p  -  q).  Green’s  theorem 
becomes 


where  A^(p)  is  the  solid  angle  around  p  subtended  by  Q; 
for  example  if  is  a  cube,  then  A  is  47t  inside  the  cube, 
lit  on  a  face,  n  on  an  edge,  and  n/2  at  a  comer  of  the 
cube. 

We  can  simplify  the  notation  by  introducing  linear 
operators  and  cP',  which  can  be  defined  by  their 
actions  on  a  dummy  function  u: 

dw 

(A) 

an  I-*  I 

so  that  the  Boundary  Element  Theorem  for  Laplace’s 
equation  becomes 

A%  +  =  0  (2) 

Notice  that  if  G’  =  91^15 ,  which  is  the  region  outside  £1, 
then  /4“+/4“'  =  4n,  +  =  0,  and  c“  =  c“' . 

When  the  function  y  is  approximated  with  Finite 
Elements  as  discussed  below,  the  operators  A,  B  and  C 
become  matrices,  with  A  diagonal. 

The  Boundary  Element  theorem  (2)  provides  a  relation 
between  y  and  its  normal  derivative  at  any  point  on  the 
surface  of  £1,  so  that  given  another  relation  between  the 
two  (the  boundary  condition  (1)),  we  can  solve  for  both. 
We  wish  to  solve  for  the  normal  derivative  of  the 
potential,  so  we  combine  the  Boundary  Element  theorem 
and  the  boundary  condition  to  obtain 

(^  +  ^B  +  C)v^  = -(4  +  B)«>  (3) 

Note  that  this  result  is  only  true  if  the  domain  Q  is  free  of 
singularities. 

In  the  case  of  our  model  of  the  fish,  the  domain  of  interest 
is  that  outside  the  fish  and  the  object,  extending  to 
infinity.  We  have  solved  for  the  normal  derivative 
because  it  is  this  that  determines  the  potential  difference 
across  the  fish  skin,  which  in  turn  determines  the 
response  of  the  electroreceptor  cells. 

The  solution  of  (3)  yields  the  potential  derivative  for  the 
fish  with  no  object  in  its  environment.  The  solution  for 
the  fish  with  object  is  obtained  by  introducing  an  induced 
dipole.  Let  S'  be  the  potential  in  the  presence  of  the 
dipole.  Without  loss  of  generality,  we  may  assume  the 
dipole  to  be  at  the  origin,  so  that  the  vector  strength  d  of 
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the  dipole  is  (proportional  to)  the  gradient  of  \|r  at  the 
origin.  This  gradient  may  be  written  as  a  surface  integral 
by  differentiating  the  Boundary  Element  theorem: 

<1  -  v»  (0)  =  (4) 

We  now  separate  out  the  singular  part  of  'E,  defining  4^1 
by  subtracting  the  dipole  contribution: 

'Ej  =  4'-^ 
r 

Given  that  satisfies  the  Boundary  Element  equation 
(2),  because  it  is  free  of  singularities,  and  4*  satisfies  the 
boundary  conditions,  we  may  derive  the  equation 
satisfied  by  4* 

(^  +  4B  +  C)4'^  =  -  +  + 

...(5) 

4.  Computational  Method 

In  order  to  discretize  the  boundary  element  method,  we 
have  created  a  mesh  of  triangles  covering  the  surface  of 
the  fish,  as  shown  in  Figure  3,  using  the  Distributed 
Irregular  Mesh  Environment  (DIME)^,  a  portable 
programming  environment  designed  for  calculations 
with  unstructured  triangular  meshes  on  distributed 
memory  parallel  processors. 

We  discretize  the  field  with  linear  Finite  Elements: 

V(*)  =  X't'y^vW 

V 

where  is  the  value  of  the  field  at  the  node  v  and 
(x)  is  the  piecewise  linear  function  which  is  unity  at 
the  node  v  and  zero  at  every  other  node.  The  normal 
derivative  can  be  similarly  discretized. 

As  observed  above,  the  operators  B  and  C  become 
matrices,  and  we  define  the  matrix  element  lo  be  the 
value  of, 

(bX) 

which  is  the  operator  B  applied  to  the  nodal  basis 
function  for  node  v  and  evaluated  at  the  position  of  node 
p.  Similarly  for  the  operator  C. 

We  can  calculate  these  matrix  elements  either  by 
Gaussian  integration on  the  uiangles  neighboring  node 
V,  or  analytically.  It  is  a  useful  check  on  the  matrix 


Figure  3:  A  typical  mesh  covering  the  surface  of  the  fish, 
containing  IW  nodes.  The  mesh  is  double-sheeted,  for 
the  two  sides  of  the  fish. 


element  calculation  that  as  the  number  of  Gauss  points 
increases,  the  result  approaches  the  analytic  result 


To  solve  for  the  potential  in  the  presence  of  the  object, 
the  procedure  is  then  as  follows.  First  we  solve  for  the 
potential  y  on  the  surface  of  the  fish  in  the  absence  of  the 
dipole  singularity  using  (3),  then  calculate  the  dipole 
strength  as  the  gradient  of  this  potential  at  the  position  of 
the  object  using  (4).  Now  we  solve  for  4*  with  this  dipole, 
using  equation  (S).  One  way  to  visualize  the  result  is  to 
display  4^^  -  \)/„,  which  is  proportional  to  the  voltage 
difference  acros  the  skin,  and  thus  contains  all  the 
elecuoscnsory  information  regarding  the  object  which  is 
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accessible  to  the  fish.  Notice  that  it  is  the  same  matrix  to 
be  solved  for  both  of  these  calculations,  with  different 
right  hand  sides.  Thus  it  would  be  computationally 
efficient  to  decompose  the  matrix  and  back-substitute  fcH- 
each  solve,  rather  than  starting  afresh  each  time. 

For  distributed  memory  parallel  computation,  we  have  a 
packaged  LU  solver"  for  full  matrices  with  partial 
pivoting,  and  the  solver  is  used  in  three  stages  as  follows. 
First  the  user  makes  an  initialization  call,  sending  the 
matrix  size;  then  an  LU  decomposition  call,  where  the 
user  passes  a  function  pointer  which  will  calculate  any 
required  matrix  element;  then  a  back-substitution  stage, 
where  the  user  passes  a  function  pointer  which  will 
calculate  an  element  of  the  right-hand-side  vector. 

The  manipulation  of  the  mesh  is  done  redundantly  in 
each  processor,  so  that  before  the  solve  step  each 
processor  has  an  identical  copy  of  the  mesh,  and  is  thus 
capable  of  calculating  any  of  the  matrix  elements  or 
right-hand-side  elements.  When  the  solver  is  initialized, 
the  parallel  decomposition  of  the  matrix  is  dealt  with  by 
the  solver;  and  it  automatically  balances  the  matrix 
element  computation  and  solving  between  the  processors 
without  user  input 

The  solution  vector  is  returned  in  a  distributed  form  to 
the  processors,  and  a  simple  combining  operation  across 
the  parallel  machine  gives  the  complete  solution  to  each 
processor.  We  may  visualize  the  solution  using  a  variety 
of  the  tools  from  the  DIME  environment 

This  code  is  an  example  of  distributed  memory 
programming  at  its  easiest  and  most  efficient:  the 
difficult  part  of  the  programming  is  the  sequential  part, 
which  is  setting  up  and  manipulating  the  mesh  over  the 
fish  skin,  and  the  most  time-consuming  part  of  the 
computation  is  the  setting  up  and  solution  of  linear 
equations,  which  happens  without  any  effort  from  the 
user.  The  parallel  programming  has  been  done  in  writing 
the  matrix  solver:  when  more  such  tools  are  available, 
parallel  programming  will  become  much  easier. 

Let  us  compare  the  Boundary  Element  method  with  a 
more  conventional  finite  difference  approach  to  solving 
elliptic  problems. 

To  implement  the  finite  difference  method,  we  would 
first  make  a  mesh  filling  the  domain  of  the  problem,  that 
is  a  three  dimensional  mesh,  then  for  each  mesh  point  set 
up  a  linear  equation  relating  its  field  value  to  that  of  its 
neighbors.  We  would  then  need  to  solve  a  set  of  sparse 
linear  equations.  In  the  case  of  an  exterior  problem  such 
as  ours,  we  would  need  to  pay  special  attention  to  the  far- 
field,  making  sure  the  mesh  extends  out  far  enough  and 
that  the  proper  approximation  is  made  at  this  outer 
boundary. 


With  the  Boundary  Element  method,  we  discretize  only 
the  surface  of  the  domain,  and  again  solve  a  set  of  linear 
equations,  except  that  now  they  are  no  longer  sparse.  The 
far-field  is  no  longer  a  problem,  since  this  is  taken  care  of 
analytically. 

If  it  is  possible  to  make  a  regular  grid  surrounding  the 
domain  of  interest,  then  the  Finite  Difference  method  is 
probably  more  efficient,  since  multigrid  methods  or 
alternating  direction  methods  will  be  faster  than  the 
solution  of  a  full  matrix.  It  is  with  complex  geometries 
however,  that  the  Boundary  Element  method  can  be 
faster  and  more  efficient,  on  sequential  or  distributed 
memory  machines.  It  is  much  easier  to  produce  a  mesh 
covering  a  curved  two-dimensional  manifold  than  a 
three-dimensional  mesh  filling  the  space  exterior  to  the 
manifold.  If  the  manifold  is  changing  fix}m  step  to  step, 
the  2D  mesh  need  only  be  distorted,  whereas  a  3D  mesh 
must  be  completely  remade,  or  at  least  strongly 
smoothed,  to  prevent  tangling.  If  the  3D  mesh  is  not 
regular,  the  user  faces  the  not  inconsiderable  challenge  of 
explicit  load  balancing  and  communication  at  the 
processor  boundaries. 

We  feel  that  the  existence  of  distributed  matrix  solving 
software  makes  the  Boundary  Element  method 
preferable  to  conventional  Finite  Difference  methods, 
since  it  is  competitive  in  computation  time,  and  much 
easier  to  program. 

5.  Results 

Figure  4  shows  four  fish  in  various  unlikely  positions. 

For  this  initial  investigation  we  have  chosen  to  set  the 
effective  skin  thickness  E  to  be  2  cm,  after  measurements 
by  Scheich  and  Bullock  this  figure  has  significant 
error,  and  of  course  the  real  fish  has  variable  4  over  its 
body. 

Figure  5  shows  a  side  view  of  the  fish  with  the  free  field 
y  (no  object)  shown  in  gray  scale,  and  we  can  see  how 
the  potential  ramp  at  the  skin-body  interface  has  been 
smoothed  out  by  the  resistivity  of  the  skin.  Figure  6 
shows  the  computed  potential  contours  for  the  midplane 
around  the  fish  body,  showing  the  dipole  field  emanating 
from  the  electric  organ  in  the  tail. 

Figure  7  shows  the  difference  field  -  yn  for  three 
object  positions,  near  the  tail  (left),  at  the  center  (middle) 
and  near  the  head  of  the  fish  (right).  In  each  case  the 
object  is  3cm  above  the  midplane,  and  the  fish  is  21  cm 
long.  It  can  be  seen  that  the  difference  field,  which  is  also 
the  sensory  input  for  the  fish,  is  greatest  when  the  object 
is  close  to  the  head.  A  better  view  of  the  difference 
voltage  is  shown  in  Figure  8,  which  shows  the  values  of 
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the  difference  voltage  on  the  midline  of  the  fish,  for 
various  object  positions.  Again  it  may  be  seen  that  the 
maximum  sensory  input  occurs  when  the  object  is  close 
to  the  head  of  the  fish,  rather  than  the  tail,  from  which  the 
dipole  field  emanates. 

References 

1.  T.  H.  Bullock  and  W.  Heiligenberg,  (eds). 

Electroreception,  Wiley,  New  York,  1986. 

2.  H.  W.  Lissman,  On  the  Function  and  Evolution  of 
Electric  Organs  in  Fish,  J.  Exp.  Biol.,  35  (1958)  156. 

3.  W.  Heiligenberg,  Theoretical  and  Experimental 
Approaches  to  Spatial  Aspects  of  Electrolocation,  J. 

Comp.  Physiol.,  103  (1975). 

4.  M.  Bacher,  A  New  Method  for  the  Simulation  of 
Electric  Fields,  Generated  by  Electric  Fish,  and  their 
Distortions  by  Objects,  Biol.  Cybem.  47  (1983)  51. 

5.  J.  D.  Jackson,  Classical  Electrodynamics,  Wiley,  New 
York,  1975,  p.  296. 

6.  J.  Bastian,  Electrolocation,  J.  Comp.  Physiol.,  144 
(1981) 

7.  T.  A.  Cruse  and  F.  J.  Rizzo  (eds.).  Boundary  Integral 

Egiujtion  Method:  Computational  Applications  in 
Applied  Mechanics,  ASME  Proc.  AMD-Vol.  11 
(1975).  Figure  4:  Four  fish  with  simple  shading. 

8.  C.  A.  Brebbia  et  al.,  (eds).  Boundary  Elements, 

Springer- Verlag,  Berlin,  1983. 

9.  R.  D.  Williams,  DIME:  A  Users  Manual,  Caltech 
Concurrent  Computation  Project  Report  C3P-861 
(1990). 

10.  R.  W.  Cowper,  Gaussian  Quadrature  Formulas  for 
Triangles,  Int.  J.  Numer.  Methods  Eng,  7  (1973)  405. 

11.  P.  G.  Hipes,  Comparison  ofLU  and  Gauss-Jordan 
System  Solvers  for  Distributed  Memory 
Multiprocessors,  Caltech  Concurrent  Computation 
Project  report  C3P-652c,  To  Be  Published  in 
Concurrency,  Practice  and  Experience. 

12.  H.  Scheich  and  T.  H.  Bullock,  The  Detection  of 
Electric  Fields  from  Electric  Organs,  in 
Electroreceptors  and  Other  Specialized  Receptors  in 
Lower  Vertebrates,  (A.  Fesand,  ed.).  Springer- Verlag, 

Berlin,  1974. 


476 


Figure  5:  Potential  distribution  on  the  surface  of  the  fish.  Figure  6;  Potential  contours  on  the  midplane  of  the  fish, 

with  no  external  object.  showing  dipole  distribution  from  the  tail. 


Figure  7;  Gray-scale  plots  of  voltage  differences  due  to  Figure  8;  Envelc^  of  voltage  differences  along  midline 

an  object  at  positions  (left)  near  tail,  (middle)  at  center  of  the  fish,  for  20  object  postions,  each  3cm  above  mid- 

and  (right)  near  head.  Each  object  is  3cm  above  mid-  plane. 
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Abstract 

Two  parallel  algorithms  for  classical  molecular  dy¬ 
namics  are  presented.  The  first  assigns  each  pro¬ 
cessor  to  a  subset  of  particles;  the  second  assigns 
each  to  a  fixed  region  of  3d  space.  The  algorithms 
are  implemented  on  1024-node  hypercubes  for  prob¬ 
lems  characterized  by  short-range  forces,  diffusion 
(so  that  each  particle’s  neighbors  change  in  time), 
and  problem  size  ranging  from  250  to  10000  parti¬ 
cles.  Timings  for  the  algorithms  on  the  1024-node 
NCUBE/ten  and  the  newer  NCUBE  2  hypercubes 
are  given.  The  latter  is  found  to  be  competitive 
with  a  CRAY-XMP,  running  an  optimized  serial  al¬ 
gorithm.  For  smaller  problems  the  NCUBE  2  and 
CRAY-XMP  are  roughly  the  same;  for  larger  ones 
the  NCUBE  2  (with  1024  nodes)  is  up  to  twice  as 
fast.  Parallel  efficiencies  of  the  algorithms  and  com¬ 
munication  parameters  for  the  two  hypercubes  are 
also  examined. 


Introduction 

Molecular  dynamics  (MD)  simulations  are  com¬ 
monly  used  to  calculate  static  (thermodynamic)  and 
dynamic  (transport)  properties  of  liquid  and  solid 
state  systems.  Each  of  the  N  atoms  (or  molecules) 
is  treated  as  a  point  mass  and  Newton’s  equations  of 
motion  then  integrated  to  move  each  atom  forward 
in  time.  The  physics  of  the  model  is  encompassed  in 
the  potential  energy  functional  for  the  system  from 
which  individual  force  equations  for  each  atom  can 
be  derived. 

VVe  are  interested  in  a  general  class  of  MD  prob¬ 
lem  that  has  three  salient  characteristics.  The  first 
is  short-range  forces,  meaning  that  each  atom  inter¬ 
acts  only  with  other  atoms  that  are  less  than  a  cutoff 
distance  away.  Many  solid  and  liquid  materials 

*This  work  was  performed  at  Sandia  National  Laborato¬ 
ries  which  is  operated  for  the  U.S.  Department  of  Energy 
under  contract  number  DE^AC04-76DP00789. 


are  modeled  this  way  due  to  electronic  screening  ef¬ 
fects.  Hence  the  computation  required  is  only  0(N) 
instead  of  ©(ATlogjTV^)  as  in  the  long-range  force 
case. 

The  second  characteristic  is  that  atoms  diffuse. 
Thus,  each  atom’s  neighbors  change  as  the  simu¬ 
lation  progresses.  While  the  algorithms  we  develop 
are  relevant  to  the  fixed  lattice  case  (neighbors  of  an 
atom  remain  the  same  throughout  the  simulation), 
it  is  a  harder  problem  to  efficiently  maintain  a  list 
of  neighbors.  Any  liquid  simulation  and  most  solid 
simulations  where  structure  is  changing  require  this. 

The  third  characteristic  is  problem  size.  We  con¬ 
sider  problems  ranging  from  a  few  hundred  atoms 
to  several  thousand.  The  vast  majority  of  work  in 
the  field  is  on  systems  of  this  size  and  many  macro¬ 
scopic  features  can  be  accurately  modeled  by  such 
systems  [1,2].  A  model  is  typically  designed  with  N 
as  small  as  possible  to  capture  the  desired  macro¬ 
scopic  effects.  The  goal  is  then  to  perform  each 
timestep  as  quickly  as  possible  since  each  step  repre¬ 
sents  only  ~  10”*®  seconds  of  “real”  time.  In  prac¬ 
tice  tens  or  hundreds  of  thousands  of  timesteps  are 
needed.  Thus  it  is  more  interesting  to  be  able  to  do 
100,000  timesteps  of  a  1000  atom  system  than  1000 
timesteps  of  a  100,000  atom  system. 

As  has  been  extensively  discussed,  MD  algorithms 
are  inherently  parallel  [3,4].  Previous  work  on  hy¬ 
percubes  has  demonstrated  their  potential  for  MD, 
but  has  typically  been  done  with  relatively  few  pro¬ 
cessors  [5,6].  Our  goal  in  this  research  was  to  imple¬ 
ment  the  fastest  parallel  algorithm  possible  for  this 
class  of  problem  to  see  if  it  could  perform  as  well  as 
the  best  serial  algorithm  on  a  CRAY-XMP  vector 
supercomputer.  This  is  a  difficult  task,  since  MD 
algorithms  can  be  vectorized  and  execute  at  tens  of 
thousands  of  timesteps  per  hour  on  a  CRAY.  As  we 
shall  see,  achieving  these  speeds  with  current  gener¬ 
ation  hypercubes  requires  at  least  512  processors. 

In  the  next  section  the  model  problem  is  de¬ 
scribed.  Then  two  serial  algorithms  are  discussed 
along  with  their  corresponding  parallel  implementa- 
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tions.  Timings  of  the  serial  versions  on  a  CRAY- 
XMP  and  of  the  parallel  algorithms  on  the  hyper¬ 
cubes  are  given.  Finally,  some  comments  are  made 
with  regard  to  comparing  the  two  architectures  and 
conclusions  drawn  as  to  the  fastest  algorithms. 


in  space  and  storing  velocity  and  force  in  units  of 
distance.  The  simplest  scheme  for  integrating  equa¬ 
tion  (2)  is  also  used,  a  leapfrog  algorithm,  since  high 
accuracy  in  the  integration  is  not  a  concern  (due  to 
the  approximations  inherent  in  the  potential  func¬ 
tion  1^). 


Model  problem 


The  physical  system  modeled  is  a  block  of  A1 
atoms,  periodic  in  all  3  dimensions  so  as  to  simu¬ 
late  homogeneous  bulk  material.  The  block  sides 
are  multiples  of  the  A1  lattice  constant  of  4.04A. 
Atoms  can  be  removed  from  the  lattice  to  study 
point  defects,  or  arranged  differently  and  given  suit¬ 
able  boundary  conditions  to  study  planar  defects. 
We  model  it  at  a  temperature  slightly  less  than  the 
melting  point  of  A1  (930"  A')  at  constant  N,  V  (vol¬ 
ume),  and  E  (energy).  The  interaction  between 
atoms  separated  by  a  distance  r  is  assumed  to  be 
pairwise  and  given  by  the  Morse  potential 

=  4>o  (1) 


with  constants  (^o,  a,  and  tq  defined  for  Al.  The 
potential  function  is  then  cut  and  shifted  so  as  to 
go  to  zero  at  a  distance  r  =  r^.  The  computational 
task  at  each  time  step  is  to  integrate  the  set  of  N 
coupled  ODE’s  given  by 


m 


dt^ 


'  =  -E 


dM^ij)  Xj  -  Xj 
dr  ra 


(2) 


where  the  summation  is  over  all  atoms  within  a  dis¬ 
tance  Tc  of  atom  i.  The  initial  conditions  are  spec¬ 
ified  by  choosing  a  system  energy  and  correspond¬ 
ing  initial  velocities  for  the  atoms.  As  the  integra¬ 
tion  proceeds,  the  system  equilibrates  and  various 
effects  can  be  studied  and  calculated  such  as  diffu¬ 
sion,  melting,  etc. 

All  of  the  algorithms  discussed  take  advantage  of 
•several  computational  tricks  common  to  MD  (see  [7] 
for  example).  First  and  foremost,  the  force  (deriva¬ 
tive  of  equation  (1))  is  tabulated  for  10000  dis¬ 
tances  at  the  beginning  of  the  simulation.  When 
it  is  needed  in  a  force  calculation,  a  value  is  simply 
linearly  interpolated  from  the  table.  This  is  a  fast 
operation  on  the  CllAY-XMP  because  of  its  gather- 
scatter  hardware  and  is  a  significant  savings  because 
energy  functionals  more  complicated  than  equation 
(1)  (such  as  pseudopotentials)  need  only  be  calcu¬ 
lated  once  and  hence  are  no  more  cosfly  to  use.  The 
liypercubes  liave  ample  memory  for  each  processor 
to  store  the  full  table.  Additionally,  square  roots 
and  excess  flops  are  avoided  by  calculating  the  forces 


Serial  Algorithms 

The  basic  kernel  of  computation  required  to  inte¬ 
grate  equation  (2)  is  as  follows.  At  each  timestep, 
each  atom  calculates  its  distance  r  to  each  of  its  near 
neighbors.  If  r  <  then  the  force  due  to  that  neigh¬ 
bor  is  calculated.  This  is  done  in  turn  for  every  atom 
and  the  summed  forces  used  to  update  velocities  and 
positions.  The  key  to  performing  these  calculations 
efficiently  is  to  minimize  the  number  of  neighbors 
that  must  be  checked  for  possible  interactions. 

The  first  serial  algorithm  (SI)  uses  a  neighbor  list 
to  accomplish  this.  For  each  atom  we  create  a  list  of 
all  its  neighbors  within  a  sphere  of  radius  r,  >  by 
calculating  its  distance  to  all  the  other  atoms.  This 
list  is  used  for  a  few  timesteps  to  calculate  all  pair¬ 
wise  interactions;  then  it  is  rebuilt  before  a  neighbor 
could  have  moved  from  a  distance  r  >  r,  to  r  <  r^. 
Building  the  list  requires  0{N^)  operations,  but  is 
amortized  over  several  timesteps. 

For  N  >  2000  atoms  it  becomes  more  efficient  to 
make  the  neighbor  list  in  the  following  way.  The 
atoms  are  first  sorted  in  one  dimension  (the  ver¬ 
tical).  Each  atom  then  only  need  examine  neigh¬ 
bors  in  the  sorted  list  that  are  less  than  a  vertical 
distance  r,  away.  Hence  the  entire  update  requires 
only  0{N  \og2  N)  operations  due  to  the  sort.  The 
sorting  algorithm  that  appears  to  work  best  for  this 
problem  is  a  shell  sort.  It  is  faster  than  a  heap  or 
quick  sort  because  the  atom  list  is  only  partially 
disordered  from  its  previous  state.  We  note  that  it 
should  be  possible  to  construct  a  variant  of  the  quick 
sort  which  would  work  in  0{N  log2  k)  time  where  k 
is  the  maximum  distance  an  atom  has  moved  in  the 
list  since  the  last  sorting. 

Algorithm  SI  also  takes  advantage  of  Newton’s 
3rd  law  so  that  an  atom  only  need  check  half  i^<! 
neighbors.  Hence  a  force  is  calculated  once  for  each 
pair  of  atoms,  not  for  each  atom  in  the  pair. 

The  second  serial  algorithm  (S2)  is  similar  to  51 
except  in  the  way  it  calculates  the  neighbor  list.  The 
atoms  are  binned  into  3d  boxes  with  sides  s  >  r, 
and  the  neighbor  list  for  each  atom  constructed  by 
checking  atoms  in  the  neighboring  26  boxes  (with 
Newton's  3rd  law,  only  13  boxes).  We  note  that 
S2  is  a  much  faster  technique  than  the  related  algo¬ 
rithm  which  bins  the  atoms  at  every  time  step  into 
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boxes  of  side  s  >  Tc  and  does  not  use  a  neighbor 
list.  This  is  because  the  cost  of  checking  all  atoms 
in  neighboring  boxes  every  timestep  outweighs  the 
cost  of  periodically  constructing  a  neighbor  list  from 
the  atoms  in  larger  boxes. 

Parallel  Algorithms 

The  first  parallel  algorithm  (PI)  is  an  adaptation 
of  51.  Each  processor  is  assigned  a  set  of  N/Np 
atoms  to  update  for  the  duration  of  the  simulation, 
where  Np  is  the  number  of  processors  (nodes).  Af¬ 
ter  every  timestep  each  node  broadcasts  its  updated 
atom  positions  to  every  other  node.  This  means 
each  node  receives  the  current  xyz  positions  of  all 
N  atoms,  which  it  uses  to  do  force  calculations  for 
the  next  timestep.  Similar  to  51,  each  node  builds 
a  neighbor  list  for  its  subset  of  atoms,  so  that  it  can 
efficiently  calculate  the  required  forces.  As  in  51, 
the  neighbor  list  is  rebuilt  every  few  timesteps. 

The  communication  portion  of  PI  performs  the 
following  task.  Each  node  has  a  small  unique  piece 
of  a  large  vector.  We  want  every  node  to  end  up  with 
a  copy  of  the  full  vector.  This  can  be  done  quickly 
using  the  hypercube’s  connectivity.  Each  node  first 
exchanges  information  with  an  adjacent  node  in  the 
vector;  it  now  has  a  contiguous  piece  of  the  vec¬ 
tor  twice  as  long.  It  then  exchanges  this  piece  with 
a  node  two  positions  away;  then  with  a  node  four 
away,  etc.  At  the  last  step  each  node  exchanges  half 
the  vector  with  a  node  Np/2  positions  away.  Thus 
the  global  accumulation  of  the  vector  is  done  in  d  ex¬ 
changes  (read/ write  pairs)  where  d  is  the  dimension 
of  the  hypercube,  each  exchange  being  done  with  a 
neighboring  node  (in  the  hypercube  topology).  We 
note  this  method  exchanges  the  same  amount  of  in¬ 
formation  as  the  circular  ring  scheme  suggested  for 
the  long-range  force  problem  [3],  but  requires  only  d 
messages  to  be  sent  (and  read),  instead  of  Np.  This 
offers  a  large  savings  on  the  hypercubes  where  the 
cost  of  message  start-up  is  significant.  It  does  re¬ 
quire  each  node  to  have  sufficient  memory  to  store 
the  entire  position  vectors,  but  this  is  not  a  difficulty 
on  the  hypercubcs  for  the  problem  sizes  considered 
here. 

The  computational  work  required  in  PI  for  each 
node  is  not  simply  that  of  51  divided  by  Np,  be¬ 
cause  Newton’s  3rd  law  is  not  implemented.  To 
do  so  would  require  the  force  vectors  be  globally 
exchanged  at  each  timestep  similar  to  the  position 
vectors.  Since,  as  we  shall  see,  communication  costs 
are  roughly  .50%  of  the  total  execution  time  for  this 
algorithm  (on  the  full  NCUBE/ten),  it  would  not  be 


effective  to  halve  the  computation  at  the  expense  of 
doubling  the  amount  of  communication  required. 

The  second  parallel  algorithm  (P2)  takes  more  ad¬ 
vantage  of  the  local  nature  of  the  force  interactions, 
similar  to  52.  Each  processor  is  assigned  a  fixed  re¬ 
gion  of  space  (a  small  box)  and  updates  the  positions 
of  all  atoms  within  its  box  in  a  given  timestep.  We 
require  the  box  side  to  be  s  >  so  that  each  node 
need  only  receive  information  from  its  neighboring 
26  boxes.  If  r,  >  s  >  then  each  node  must  check 
all  the  atoms  in  neighboring  boxes  at  every  timestep. 
l(s  >  r,  then  it  becomes  more  effective,  as  in  52,  to 
construct  a  neighbor  list,  use  it  for  a  few  timesteps, 
and  rebuild  the  list  periodically. 

To  insure  each  box  can  get  information  from  its 
26  3d  neighbors  with  only  nearest  neighbor  commu¬ 
nication  between  nodes,  the  hypercube  is  mapped 
into  a  3d  mesh.  Gray  coded  in  each  dimension.  The 
required  information  can  then  be  acquired  by  each 
node  with  only  6  exchanges.  First,  each  node  passes 
its  atom  positions  to  its  west  neighbor,  then  to  its 
east.  Next,  it  passes  all  of  its  accumulated  informa¬ 
tion  to  the  north,  then  to  the  south.  Finally,  the 
entire  list  of  atom  positions  is  sent  to  its  upward 
neighbor,  then  to  its  downward.  In  addition,  each 
node  must  pass  along  special  information  for  atoms 
that  left  its  box  during  the  previous  timestep.  Thi.s 
includes  velocities,  various  flags,  and  when  a  neigh¬ 
bor  list  is  being  kept,  the  neighbor  list  for  the  atom 
itself.  Each  node  that  receives  the  extra  informa¬ 
tion  checks  to  see  if  the  atom  has  moved  into  its 
box.  Packing  this  information  into  message  buffers 
and  reorganizing  each  node’s  list  of  current  atoms 
as  atoms  move  between  boxes  is  all  extra  overhead 
unneeded  in  PI. 

The  computation  portion  of  P2  is  similar  to  52, 
except  that  again  Newton’s  3rd  law  is  not  imple¬ 
mented,  since  it  would  require  force  values  to  be 
exchanged.  Also,  although  the  fraction  of  time 
spent  on  communication  is  not  always  as  high  as 
in  PI,  the  efficiency  of  the  6-exchange  mechanism 
described  above  would  be  partially  lost  if  each  node 
only  needed  information  from  13  neighbor  boxes  in¬ 
stead  of  26. 

Results 

Algorithms  51  and  52  were  implemented  on  a  sin¬ 
gle  processor  of  a  CRAY-XMP  with  special  atten¬ 
tion  given  to  insuring  the  critical  routines  (neigh¬ 
bor  list  formation,  force  calculation,  the  integra¬ 
tion  step  itself)  vectorized.  Algorithms  PI  and  P2 
were  implemented  on  both  the  NCUBE/ten  and  the 
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NCUBE  2  hypercubes.  The  NCUBE/ten  at  San- 
dia  has  1024  nodes  and  the  NCUBE  2  currently  has 
64  nodes;  the  latter  will  soon  be  upgraded  to  1024 
nodes  (maximum  configuration  is  8192  nodes).  Both 
PI  and  P2  ran  3.3-3.5x  faster  (per  node)  on  the 
NCUBE  2  than  on  the  NCUBE/ten  for  all  problem 
sizes  and  numbers  of  nodes.  As  will  be  discussed 
below,  this  behavior  should  hold  up  to  1024  nodes 
on  the  NCUBE  2.  Hence,  the  timings  given  for  the 
NCUBE  2  (for  more  than  64  nodes)  are  NCUBE/ten 
times  divided  by  3.3. 

The  particular  choice  of  parameters  for  which  tim¬ 
ings  are  given  is  =  A.OAA  (2nd  nearest  neighbor 
distance  in  Al)  and  =  5.5A,  with  recalculation 
of  the  neighbor  list  done  every  T„  =  20  timesteps. 
The  choice  of  r,  and  T„  is  somewhat  arbitrary  and 
in  fact  optimal  choices  depend  on  the  temperature 
and  other  parameters  peculiar  to  a  particular  run  of 
the  simulation.  For  comparison  purposes  with  other 
CRAY  timings  [8]  we  chose  not  to  implement  an 
automated  recalculation  procedure;  instead  we  use 
these  choices  as  representative  values.  The  system 
sizes  studied  were  from  N  =  256  atoms  (4x4x4  fee 
lattice)  to  N  =  10976  (14x14x14  fee  lattice). 

The  timings  for  the  algorithms  are  listed  in  Table 
I  and  displayed  in  Fig.  1.  The  data  shows  that 
for  the  CRAY,  52  is  the  fastest  algorithm  except 
for  the  oiiiallesi  problems.  As  the  graph  shows,  the 
work  it  requires  is  linear  in  N.  Algorithm  51  has 
the  0{N  logo  AT)  sorting  dependence  that  begins  to 
slow  it  for  larger  N. 

The  PI  times  are  all  for  1024  processors  except 
for  N  =  256  and  N  =  500  which  can  only  use  256 
nodes  and  N  =  864  which  uses  512.  As  Fig.  1  shows 
PI  times  increase  roughly  linearly  in  N ,  despite  the 
0{N~)  cost  of  creating  the  neighbor  list.  This  is  be¬ 
cause  while  the  fraction  of  time  spent  on  the  neigh¬ 
bor  list  calculation  increases  from  3%  to  54%  (as  N 
increases  from  256  to  10976),  the  fraction  spent  on 
communication  actually  decreases  from  72%  to  39%. 

The  P2  times  reflect  the  fact  that  boxes  cannot 
be  smaller  than  the  potential  cutoff  r<;,  so  we  are 
restricted  to  using  a  small  number  of  nodes  for  the 
smaller  problems.  The  number  of  nodes  used  by  P2 
is  shown  in  the  Np  column.  The  “Box”  entry  is  A  if 
P2  used  small  boxes  (r,  >  s  >  r^).  A  B  entry  indi¬ 
cates  boxes  of  size  s  >  r,  could  be  used  and  hence 
a  neighbor  list  wets  formed  and  taken  advantage  of 
as  di.scussed  in  the  previous  section.  The  changes  in 
A,  B,  and  Ap  as  N  increases  cause  the  kinks  in  the 
P2  curve.  Nevertheless,  P2  is  the  fastest  of  all  the 
algorithms  for  problems  large  enough  to  use  most  of 
the  available  processors. 

The  final  two  lines  in  Table  I  illustrate  an  impor- 


Table  I:  CPU  time  (in  seconds)  for  100  timesteps 
of  the  algorithms  on  various  problem  sizes.  The 
serial  times  51,52  are  for  the  CRAY-XMP;  the 
parallel  times  P1,P2  are  for  the  NCUBE  2  (in¬ 
ferred  from  NCUBE/ten  times).  The  Np  and 
Box  columns  refer  to  P2  and  are  explained  in  the 


text,  as  are  the  two  special 


lines  at  the  bottom. 


N 

SI 

S2 

PI 

P2 

Ap 

Box 

256 

0.65 

0.80 

1.09 

4.16 

64 

A 

500 

1.50 

1.65 

1.64 

10.4 

64 

A 

864 

3.05 

2.75 

2.23 

7.55 

64 

B 

1372 

6.10 

4.25 

2.90 

11.8 

64 

B 

2048 

11.5 

6.25 

3.60 

4.51 

512 

A 

2916 

19.5 

9.20 

5.82 

8.29 

512 

A 

4000 

31.0 

12.0 

7.70 

10.9 

512 

A 

5324 

46.6 

16.2 

12.0 

6.98 

512 

B 

6912 

70.2 

20.8 

15.7 

9.82 

512 

B 

8788 

100 

27.1 

21.9 

13.1 

512 

B 

10976 

140 

34.0 

30.8 

15.0 

512 

B 

4096 

16x8x8 

4.64 

1024 

A 

10648 

22x11x11 

8.52 

1024 

B 

Fig.  1;  Timings  on  the  CRAY-XMP  (51,52)  and 
NCUBE  2  (P1,P2)  for  all  4  algorithms.  Data  values 
are  from  Table  I.  The  isolated  circles  arc  the  last  two 
table  entries.  The  timings  indicate  the  NCUBE  2 
(with  1024  nodes)  is  roughly  equal  to  the  CRAY  for 
smaller  problems  and  up  to  twice  as  fast  on  larger 
ones. 
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tant  point  about  P2.  These  are  two  problems  with 
dimensions  somewhat  artificially  chosen  so  eis  to  use 
all  1024  processors.  For  example,  the  22x11x11  lat¬ 
tice  (10648  atoms)  is  a  near-optimal  fit  for  1024 
boxes  with  side  s  =  r,.  The  fast  simulation  times  for 
these  two  cases  show  the  NCUBE  2  can  be  consider¬ 
ably  faster  than  the  CRAY  if  the  physical  problem 
size  can  be  tailored  to  match  the  power-of-two  mesh 
restrictions  of  the  hypercube  topology.  While  this 
is  the  inverse  of  the  way  the  experimenter  typically 
thinks  of  configuring  a  simulation,  it  is  a  useful  trick 
if  applicable. 

Data  on  the  parallel  efficiencies  of  these  algo¬ 
rithms  provide  a  means  of  predicting  timings  for 
runs  with  larger  N  or  with  Np  ^  1024.  Typical 
timing  results  for  increasing  Np  are  given  in  Fig.  2 
for  algorithm  PI.  The  data  shows  that  the  speed¬ 
up  is  nearly  linear  until  Np  =  128.  For  larger  Np, 
the  time  spent  exchanging  atom  positions  (given  by 
the  dotted  lines)  becomes  a  significant  factor  and  in 
fact  eventually  sets  a  limit  on  the  speed  achievable 
by  the  algorithm. 

Actual  timings  from  Fig.  2  show  that  for  Np  = 
64  (when  there  are  N/Np  =  32  atoms  per  node), 
the  efficiency  is  83.3%  (speed-up  of  53.3).  This  was 
generally  true  for  all  problem  sizes  with  the  same 
N/Np  ratio.  Similarly,  the  efficiency  fell  to  ~  55% 
on  all  problem  sizes  with  N/Np  ~  10.  Algorithm  P2 
uses  only  local  communication,  and  so  its  efficiency 
is  also  constant  for  a  given  N/Np  ratio  (although 
in  practice,  the  mesh  restrictions  make  it  impossible 
to  hold  it  constant  as  N  increases).  We  found  for 
boxes  of  size  s  =  the  fraction  of  time  spent  on 
communication  was  15%  and  for  s  =  it  was  50%. 

Fig.  2  also  illustrates  the  3.3x  speed-up  of  the 
NCUBE  2  vs.  the  NCUBE/ten  on  this  code  for  up 
to  64  nodes.  We  justify  our  extrapolation  of  this 
factor  to  1024  nodes  in  the  following  way.  For  small 
numbers  of  nodes,  computation  is  the  dominating 
factor,  and  the  NCUBE  2  is  3.3x  faster  than  the 
NCUBE/teii  (on  this  code)  as  the  solid  lines  in  Fig. 
2  indicate.  For  larger  numbers  of  nodes,  communi¬ 
cation  effects  must  be  considered.  Communication 
(message  p£is.sing)  between  neighboring  nodes  on  the 
hypercubes  can  be  modeled  by  the  equation 

T=A,+nAt  (3) 

where  T  is  the  time  for  a  message  of  n  bytes  to 
be  written  or  read.  A,  is  a  start-up  time,  and  Aj 
is  the  per-byte  time.  The  global  vector  accumu¬ 
late  discussed  above  can  be  used  to  determine  an 
effective  A,  and  Aj  where  now  every  node  in  the 
hypercube  is  communicating  simultaneously  and  so 
the  derived  A,  contains  some  synchronization  and 


loop  overhead.  Timing  just  the  communication  por¬ 
tion  of  PI  gave  A,  =  500/is,  Aj  =  1.5/JS  for  the 
NCUBE/ten  and  A,  =  200/is,  Ai  =  0.4/is  for  the 
NCUBE  2.  Hence,  there  is  at  least  the  same  factor 
of  3.3  speed-up  in  the  Aj  term  which  dominates  the 
time  required  to  pass  the  large  messages  used  by  PI 
and  P2.  Since  both  the  computation  and  communi¬ 
cation  portions  of  these  algorithms  are  3.3x  faster, 
we  expect  the  NCUBE  2  curves  in  Fig.  2  to  follow 
the  NCUBE/ten  curves  out  to  1024  nodes. 


Hypercube  Dimension  d  (N^  =  2^) 

Fig.  2:  Timings  on  the  hypercubes  for  the  N  = 
2048  problem  (using  algorithm  PI)  as  a  function  of 
nodes  used.  The  dotted  lines  are  the  time  spent  in 
communication,  which  becomes  a  dominating  factor 
when  the  full  cube  is  used. 


Caveats  and  Comments 

(1)  The  parallel  results  are  all  for  single  precision 
code.  The  CRAY  timings  are  for  double  precision 
(64  bits)  since  that  is  the  only  option.  MD  codes  do 
not  typically  require  double  precision  accuracy.  If  it 
were  needed,  the  hypercubes  run  in  double  precision 
(on  this  code)  about  a  factor  of  1.3  slower. 

(2)  All  the  CRAY  results  are  for  one  XMP  pro¬ 
cessor.  Some  of  the  same  techniques  used  in  PI  and 
P2  could  be  used  to  adapt  the  serial  algorithms  for 
multiple  XMP  processors. 

(3)  Our  model  problem  implemented  a  constant 
NVE  ensemble.  Another  popular  choice  is  to  hold 
N,  P  (pressure),  and  T  (temperature)  constant  -  the 
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NPT  ensemble.  This  requires  rescaling  the  box  di¬ 
mensions  and  velocities  at  each  timestep  and  would 
require  a  small  additional  exchange  of  information 
for  algorithms  PI  and  P2.  This  communication 
overhead  would  not  be  present  in  the  serial  versions. 

(4)  More  sophisticated  multi-body  forces  or  ro¬ 
tational  torques  are  often  used  in  short-range  MD 
simulations.  This  increases  the  amount  of  computa¬ 
tion  needed  relative  to  communication.  Thus,  better 
parallel  performance  versus  the  CRAY  could  be  ex¬ 
pected  for  cases  more  computation  intensive  than 
our  simpler  pair  potential  example. 

(5)  Though  the  force  calculation  is  the  key  com¬ 
putational  kernel  in  the  MD  problem,  the  quantities 
of  interest  are  often  global  parameters  like  pressure, 
structure  factors,  diffusion  coefficients,  etc.  These 
are  usually  calculated  once  every  50  or  100  timesteps 
and  add  little  to  the  overall  time  required  for  the 
simulation  in  the  serial  case.  The  same  is  true  for 
the  parallel  case;  they  can  typically  be  calculated 
from  each  node’s  local  information  and  the  value 
accumulated  quickly  as  a  global  sum. 

Conclusions 

In  summary,  we  have  implemented  two  parallel 
algorithms  for  a  common  MD  problem,  the  short- 
range  force  system.  Algorithm  PI  exchanges  global 
information  in  its  communication  portion,  but  uses 
only  local  information  for  computation.  It  has  the 
advantage  of  simplicity  and  the  ability  to  use  more 
processors  on  small  problems.  Thus  it  is  a  good 
choice  for  small  N .  Algorithm  P2  takes  advantage 
of  locality  for  both  the  communication  and  compu¬ 
tation,  but  at  the  cost  of  significant  overhead  and 
some  difficulty  in  mapping  the  physical  geometry  to 
the  hypercube  topology.  When  the  3d  mesh  fits  well, 
P2  is  the  faster  choice,  particularly  as  N  increases 
so  that  a  large  number  of  processors  can  be  used. 

On  the  NCUBE  2  with  1024  nodes  these  algo¬ 
rithms  should  be  faster  than  vectorized  CRAY-XMP 
algorithms  for  problems  with  more  than  500  atoms. 
However,  the  difference  in  speed  is  not  great  ex¬ 
cept  for  special  cases  where  the  physical  geometry 
maps  nicely  to  the  hypercube  mesh.  Nonetheless,  we 
believe  this  is  the  first  time  hypercubes  have  been 
shown  to  be  competitive  with  a  CRAY  for  this  class 
of  MD  problem. 

While  we  are  confident  our  CRAY  timings  are 
close  to  optimal  for  this  problem  [8],  we  do  not 
claim  the  same  for  tlie  parallel  case.  We  expect  the 
NCUBE  2  times  to  improve  by  a  factor  of  two  as 
its  compilers  mature  and  memory  wait  states  are 


reduced.  Furthermore,  there  are  two  issues  touched 
on  in  this  work  that  merit  further  research.  The  first 
is  whether  a  sorting  enhancement  to  algorithm  PI 
(as  is  implemented  in  SI)  would  increase  its  speed. 

A  second  issue  is  whether  a  hybrid  version  of  PI 
and  P2  might  be  faster  for  some  problems.  For  ex¬ 
ample,  problems  that  can  only  use  512  3d  boxes 
might  use  2  nodes  per  physical  box  to  perform  the 
computational  part  without  losing  too  much  to  ad¬ 
ditional  communication.  It  is  also  not  clear  whether 
forcing  a  power-of-two  mesh  to  fit  the  problem  di¬ 
mensions  (preserving  nearest  neighbor  communica¬ 
tion)  is  always  best.  For  example,  a  10x10x10  mesh 
would  fit  some  problems  well  and  could  be  run  on 
1024  nodes,  with  24  nodes  idle.  The  issue  is  how  to 
embed  such  a  mesh  into  the  hypercube  with  a  mini¬ 
mum  communication  penalty.  The  answer  might  be 
different  on  the  NCUBE/ten  and  NCUBE  2  since 
the  latter  has  cut-through  routing  for  non-nearest 
neighbor  communication.  We  plan  to  pursue  these 
issues  in  the  future. 
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Abstract 

A  model  for  the  infra-red  emission  from  the  cir¬ 
cumstellar  disc  of  a  Be  star  is  presented.  The  structure 
and  other  physical  parameters  of  this  disc  can  be  ad¬ 
justed  to  investigate  the  vifra-red  and  optical  line  emis¬ 
sion  from  such  an  envelope.  The  model  presently  under 
investigation  is  based  on  the  early  work  of  Drake  and 
has  been  computed  on  a  p  VAX II  and  a  Meiko  Comput¬ 
ing  Surface.  The  parallel  implementation  of  this  model 
allows  a  more  complex  and  realistic  structure  to  be  mod¬ 
elled  in  a  reasonable  timescale.  Doth  an  algorithmic  and 
event  decomposition  of  the  code  have  been  investigated 
and  the  two  methods  are  compared.  The  model  has  been 
applied  to  several  Be  stars  ruith  good  agreement  with  ob¬ 
servational  data. 

1  Introduction 

The  spectra  of  Be  stars  are  cliaracterized  by  tlic  pres¬ 
ence  of  emission  lines  of  liydrogen  and,  typically,  infra¬ 
red  excesses.  Recent  studies  have  established  a  corre¬ 
lation  between  the  line  an<l  infra- re<l  intensities,  and 
have  attempted  to  explain  this  relationship  in  terms  of 
models  of  the  emission  from  an  ionized  circumstellar  en¬ 
velope  or  disc  [1,2].  The  optical  lines  and  infra-red  con¬ 
tinuum  are  generally  attributed  to  recombination  and 
free-free  emission,  respectively.  These  are  treated  self- 
consistently  by  Kastner  and  Mazzali  [2]  in  that  the  same 
emitting  region  is  responsible  for  both  spectral  features. 
The  observed  shape  and  intensity  of  the  infra-red  con¬ 
tinuum  places  constraints  on  the  den.sity  and  structure 
of  the  ionized  region  [3],  which  in  turn  determines  the 
emi.ssion  line  intensity.  The  predicted  line  strength  may 
then  be  compared  with  observation  in  order  to  test  the 
reliability  of  the  model. 

Here  we  demonstrate  the  improvements  we  are 
making  to  the  approach  of  Kastner  and  Mazzali  [2] 
through  the  use  of  the  enhanced  power  of  parallel  pro¬ 
cessing.  The  modeling  technifiue,  discussed  in  Section 
3,  consists  of  combining  separate  computations  of  the 
line  and  contimuim  spectra  of  ionized  slabs,  each  with 
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constant  density,  temperature  and  thickness,  into  an 
ensemble.  The  resulting  model  can  therefore,  in  princi¬ 
ple,  describe  a  circumstellar  disc  with  any  desired  den¬ 
sity,  temperature  and/or  geometric  structure.  In  Sec¬ 
tions  4  and  5  we  describe  how  this  modeling  procedure 
is  well  suited  to  parallel  processing,  which  reduces  the 
computation  time  to  a  practical  level.  We  also  present 
preliminary  fits  to  optical  p.ud  infra-red  observed  Be 
star  spectra  obtained  contemporaneously  in  both  wave¬ 
bands. 

2  Transputer  Systems 

The  tran.sputer  is  the  computer  on  a  chip  (j)roressor, 
memory  anti  communications)  built  by  INMOS  Ltd. 
It  implements  the  communicating  setpiential  jirocesses 
(CSP)  model  [4]  of  computation  of  which  its  native  lan¬ 
guage,  OCCAM,  is  an  implementation.  Memory  is  local 
so  the  memory  bandwidth  grows  in  proportion  to  the 
number  of  transputers  (unlike  shared  memory  multipro¬ 
cessor  machines).  Each  transputer  also  has  an  external 
memory  interface  which  extends  the  address  space  into 
off-chip  memory,  although  access  to  this  is  slower  than 
for  the  on-chip  memory. 

Transputers  u.se  point  to  point  communication 
links  to  communicate  with  other  transputers.  Each 
tran.sputer  has  four  of  these  links  which  correspoinl 
to  two  OCC.\M  comimmication  channels,  one  in  each 
direction.  These  links  are  synchronous,  bi-directional 
bit  serial  links  which  can  sustain  a  data  rate  of  up 
to  20Mbits/sec.  They  can  usually  be  switche<l  (either 
manually  or  electronically)  so  as  to  permit  any  network 
(subject  to  the  restriction  of  four  links  per  transputer) 
and  as  they  arc  only  point  to  point  links,  the  commu¬ 
nications  bandwidth  does  not  .saturate  as  more  trans¬ 
puters  are  used  (unlike  with  single  or  multiple  shared 
bus  systems). 

The  T800  tran.sputer  is  a  32  hit.  10  MIPS  proces¬ 
sor  with  4  KByte  of  on-chip  memory,  a  64  bit  floating¬ 
point  co-processor  (which  can  operate  concurrently  with 
the  central  j)rores.sor)  and  capable  of  sustaining  1.5 
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Mflops.  It  is  based  on  a  RISC  (Reduced  instruction  set) 
architecture  and  requires  little  external  support  logic, 
making  it  an  ideal  programmable  building  block  for  a 
concurrent  system. 

All  the  work  discussed  in  this  paper  wa.s  devel¬ 
oped  on  a  Meiko  MIO  Computing  Surface  containing 
T800  transputers  (the  transputer  array)  and  a  T414 
(the  local  host)  which  communicates  via  a  DR  11  inter¬ 
face  to  a  ;«VAXII  compiiter  (the  host  machine). 

The  Meiko  compiler  generates  object  code  from 
standard  FORTRAN  code  which  can  be  linked  into  a 
process  that  is  referenced  from  OCCAM  as  a  library 
routine,  communicating  with  its  environment  using  chan¬ 
nels.  System  channels  can  be  used  by  the  standard 
FORTRAN  READ  and  WRITE  statements  to  commu¬ 
nicate  with  the  filing  system  and  terminal.  The  proto¬ 
col  of  these  channels  is  the  same  as  that  used  by  the 
OCCAM  library  routines  for  accessing  the  filing  sys¬ 
tem  and  terminal,  allowing  the  interconnection  of  both 
OCCAM  and  FORTRAN  processes. 

There  are  also  user  channels,  which  can  be  ref¬ 
erenced  from  FORTRAN  code  via  Meiko  supplied  rou¬ 
tines.  These  can  be  used  to  communicate  with  other 
parallel  processes.  In  this  way  FORTRAN  code  can  be 
viewed  a.s  a  communicating  process  in  the  same  way  as 
an  OCC.\M  process,  tising  the  same  channel  protocols. 
Communication  via  these  user  channels  is  much  faster 
than  communication  via  the  system  channels  dtte  to  the 
simpler  protocol  and  the  absence  of  formatting  of  the 
data. 

3  A  Constant  Thickness  Disc 
Model 

Our  computational  method  begins  with  the  FORTRAN 
coded  model  developed  by  Drake  [5.6]  which  uses  an 
escape  probability  approach  to  calculate  the  line  and 
continuum  emission  from  a  static  hydrogen  slab.  The 
mo<lel  takes  into  account  the  change  in  the  escajje  prob¬ 
ability  d\ie  to  the  presence  in  the  line  profile  of  broad 
Stark  wings.  The  atomic  transition  rate  equations  in¬ 
clude  all  the  collisional  and  radiative  terzns  for  energy 
levels  up  to  u  =  30.  with  angular  momentum  1  sublevels 
treated  explicitly  for  n  <4.  It  has  been  shown  [6]  that 
such  an  approach  does  not  cau.se  appreciable  errors  for 
the  density  range  considered.  The  slab  model  lias  uni¬ 
form  thickness,  electron  temperature  and  density,  all  of 
which  are  input  parameters. 

A  .set  of  slab  models  of  given  thickness  and  tem¬ 
perature  but  monotonically  decreasing  density  may  be 
combined  to  create  a  simple  model  of  a  cylindrical  cir- 
cumstellar  di.se  of  uniform  temperature  [2].  The  ra¬ 
dial  density  structure  is  a  steji  function  of  the  form 


line  of  sight 

I 


Figure  1:  Schematic  representation  of  the  disc  model. 

Nf^i  =  iVf,o(>‘t/>‘o)”"  where  jVf,,  is  the  electron  density 
of  the  ith  ring  and  r;  its  inner  radius.  The  overall  disc 
.spectrum  is  .simply  a  .sum  of  the  individual  slab  mod¬ 
els,  appropriately  weighted  by  their  relative  geometric 
areas.  Thus  the  relevant  parameters  which  define  a 
constant  thickness  disc  model  arc  the  inner  disc  radius, 
electron  temperature,  density  index  (n)  and  emission 
measure  (.Y^^qZ,  where  Z  is  the  disc  thickness).  Of 
these,  only  the  latter  two  affect  the  .shape  of  the  IR 
continuum  [2]  . 

4  A  Parallel  Implementation  of 
the  Model 

The  parallel  implementation  of  the  model  allows  us  to 
explore  ami  implicitly  combine  the  Drake  models  with 
only  minor  modifications  to  the  rode  by  using  a  master 
processor  to  control  the  overall  sinnilation.  The  prob¬ 
lem  of  parallel  decomposition  of  sinnilations  have  been 
extensively  documented  [7.8].  The  two  approachi's  de¬ 
scribed  here  are  an  event  decomposition  and  an  algo¬ 
rithmic  decomposition,  which  we  have  tried  to  formalize 
with  the  mathematics  of  orderings  [9]. 

The  changes  are  to  .some  of  the  I/O  code  and 
provision  of  a  harness  and  master  process  to  support 
these  changes.  In  this  way  any  modifications  to  the 
Drake  mod<’l  do  not  require  major  changes  in  the  trans¬ 
puter  implementation.  The  inptit  of  relevant  varying 
rlisc  izarameters  is  by  user  inptit  channels  and  outptit 
of  the  required  data  by  user  output  channels.  Any  non- 
essential  screen  and  file  otitput  is  commentetl  out.  All 
other  file  in])ut  is  left  tmehanged.  although  k<'yboard  in- 
jmt  must  be  redirected  to  be  either  from  a  file  or  from 
a  tiser  channel.  Diagnostic  screen  outjmt  is  sent  by  the 
harness  to  the  supervisor  bus.  a  global  low  bandwidth 
communication  bus.  where  it  is  echoed  to  the  screen  by 
the  local  host. 

In  this  wav  the  minimum  amount  of  modification 


48S 


is  required  to  the  original  FORTRAN  code.  To  imple¬ 
ment  an  event  paradigm  the  harness  forwards  the  fil¬ 
ing  system  using  Meiko  s\ipplied  routines  and  forwards 
data  using  a  modified  OCCAM  load  balanced  pipeline 
[10].  Forwarding  the  filing  system  has  the  disadvantage 
that  each  transputer  will  request  a  copy  of  each  file,  in¬ 
creasing  communication  across  the  low  bandwidth  in¬ 
terface  with  the  pVAXII  and  there  is  also  a  request 
latency  time  which  increases  as  the  number  of  trans¬ 
puters  in  the  pipe  increases.  However  the  volume  of 
data  is  typically  low  and  only  loaded  initially  to  define 
the  model.  The  alternative  of  replacing  all  file  input 
with  user  channel  input  wo\dd  result  in  a  decrease  in 
this  initial  time  but  a  large  increase  in  the  amount  of 
modification  to  the  FORTRAN  code  and  to  the  master 
process. 


This  is  due  to  work  being  buffered  further  down  the 
pipe  where  it  is  not  accessible  to  workers  earlier  in  the 
pipe.  The  effect  of  this  can  be  reduced  by  limiting  the 
degree  of  data  buffering  using  buffer  handshaking  or  al¬ 
ternatively  by  implementing  a  work  request  mechanism. 
Such  a  mechanism  increases  communication  overheads 
and  introduces  a  work  request  latency  time,  even  if  a 
more  appropriate  topology  such  as  a  tree  is  used[ll]. 

We  now  consider  a  simple  algorithmic  decom¬ 
position  based  on  a  data  flow  network,  to  reduce  the 
turnaround  time  and  memory  required  by  each  trans¬ 
puter.  This  can  be  incorporated  in  a  load  balance<l 
pipeline  to  further  increase  throughput. 

The  decomposition  of  code  iji  to  parallel  com¬ 
municating  processes  relies  on  identifying  the  data  flow 
within  the  code  and  independent  .sections  of  computa¬ 
tion.  To  identify  the  interdependence  of  a  sequence  of 
.sequential  proces.ses  we  can  define  a  partial  order  (x) 
on  the  code  in  addition  to  the  natural  total  ordering 
(<)[l2].  Con.sider  sections  of  code  P.  Q.  R  then 

P  <  Q,r'(ir{P)P\nrr{Q)  91^  {}  =>  P  x  Q 

and 

PxQ,QxP=>PxP 


Harness 


User  wrii-ten  Foriron 


Figure  2:  Decomposition  of  code  into  a  pijteline. 

Figure  2  shows  the  configuration  of  tlie  system 
for  an  event  decomposition  of  the  code.  The  local  host 
acts  as  a  local  file  server,  handling  all  communication 
with  the  host  machine.  The  master  processor  generates 
the  relevant  <lata  to  .send  to  each  worker.  Each  worker 
processor  executes  its  own  copy  of  the  Drake  model  witli 
the  parameters  sent  to  it  from  the  master  and  semis  the 
resulting  spectrum  back  to  the  master.  The  various 
slab  spectra  are  combined  by  the  master  to  produce 
the  composite  spectrum,  which  may  he  displayed  on  a 
grai)hics  monitor  as  well  as  written  to  disk  on  the  host 
machine. 

This  form  of  <leromposition  suffers  from  a  imm- 
ber  of  problems.  The  first  is  that  it  requires  a  large 
amount  of  memory  for  each  tran.sputi’r  ai.d  secondly 
if  we  are  con.sidering  a  single  model  run  then  there 
is  a  minimum  turnaround  time  defined  by  the  length 
of  time  to  compute  one  Drake  model.  This  is  a  com¬ 
mon  problem  with  event  parallelism  in  that  it  increa.ses 
throughput  but  does  not  reduce  turnaround  time.  A 
further  problem  is  that  there  is  an  overhead  in  emp¬ 
tying  the  i)ii)e.  when  work  will  not  be  load  balanced. 


where  ror(P)  is  the  set  of  variables  that  P  can 
assign  to  and  ofc(Q)  is  the  set  of  variables  accessed 
by  Q  that  it  does  not  initialise  (cf.  Bernstein's  Con¬ 
ditions  [13]).  Having  ordered  sections  of  our  code  we 
have  identified  the  interdependence  since  P  ^  Q  and 
Q  -A  P  implies  P  and  Q  can  be  performed  concurrently. 
It  also  indicates  how  processes  can  be  pipelined. 

Clearly  this  ordering  is  only  applicable  to  se¬ 
quences  of  eSP  processes  of  the  form  P:  Q.  To  extend 
it  to  code  embedded  in  control  structures  we  can  either 
consider  only  W'ctions  of  code  for  which  this  is  true  or 
look  at  the  traces  (ie.  the  sequence  of  pos.sible  events) 
of  the  corle.  Sequences  of  events  can  be  identified  that 
are  imlependent  with  the  above  partial  order  and  this 
can  be  related  to  the  FORTRAN  code.  The  Hasse  di¬ 
agram  of  the  partial  order  on  the  set  of  processes  indi¬ 
cates  the  topology  anti  interconnection  of  the  processes, 
where  each  line  segment  .symlmlizes  a  communication 
link  over  which  required  shared  tlata  must  be  passed 
(cf.  Precedence  Graphs  [14]).  It  also  indicates  any  syn¬ 
chronization  points.  The  direction  of  communication  is 
indicateti  by  the  partial  order.  Such  a  decomjmsition 
is  deatllock  free  due  to  the  absence  of  any  circular  wait 
condition  [4]. 

The  remaining  problem  is  to  map  this  diagram 
onto  the  tojwlogy  of  the  machine.  This  will  typically 
involve  the  u.se  of  either  a  multiptirposc  harness  [15] 
or  CS  tools  [10].  Alternatively  the  processes  can  be 
mapped  rlirectly.  The  problems  of  mapping  processes 
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□  Deccxnposition  onto  transputers 
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j  Looped  sections  of  code  further  refined 


Figure  3:  Basse  diagram  for  the  Drake  model 

onto  processors  have  been  documented  in  [17]. 

Figure  3  shows  the  decomposition  of  the  3000 
line  Drake  model  code.  The  processes  arc  mapped  so  as 
to  minimi.se  execution  time  and  load  balance  computa¬ 
tion.  We  have  considered  coarse  grained  decomposition 
of  the  code  such  that  the  increased  communication  time 
is  less  than  the  overlapped  computation  time.  Further 
decomposition  of  each  section  of  code  can  be  performed 
to  give  a  finer  level  of  parallelisation  and  there  will  ex¬ 
ist  an  order- preserving  mapping  between  each  level  of 
refinement. 

The  decomposition  of  the  code  has  relied  heav¬ 
ily  on  the  structure  of  the  code,  with  most  of  the  dis¬ 
tributed  proces.ses  being  FORTRAN  subroutines.  In 
this  way  the  process  of  calling  and  returning  from  a 
subroutine,  copying  the  values  of  shared  variables  and 
results  is  clo.sely  matched  by  the  effect  of  a  communi¬ 
cation  to  a  eSP  process  (4).  The  algorithmic  decompo¬ 
sition  requires  less  than  200K  per  processor,  compared 
to  350K  required  for  the  event  decomposition. 

5  Results 

A  preliminary  fit  to  the  observed  optical  and  near-IR 
continuum  of  a  Be  star,  (66  Ophiuchii),  is  shown.  The 
data  were  obtained  contemporaneously  in  March  1988 
to  Ininimi^e  confusion  arising  dtie  to  the  intrinsic  vari¬ 


Figure  4:  Preliminary  fit  to  observed  continuum  of  66 
Ophiuchii. 

ability  of  Be  stars.  A  Kuruez  model  stellar  atmosphere 
[18],  computed  by  the  ATLAS6  program  on  a  CRAY, 
which  best  corresponds  to  the  spectral  type  of  the  cen¬ 
tral  star  is  combined  with  a  disc  model  computed  as 
described  in  Section  3,  to  give  the  dashed  line  of  to¬ 
tal  emission.  A  least  squares  fit  has  been  achieved 
by  normalizing  the  Kuruez  model  to  the  bluest  opti¬ 
cal  continuum  point  (i.e.  assuming  that  the  stellar  at¬ 
mosphere  dominates  the  spectrum  at  this  wavelength) 
and  then  varying  the  disc  model  i>arameters.  The  fit 
demonstrates  clearly  that  the  combination  of  a  stellar 
atmosphere  with  free-free  emission  from  an  ionized  disc 
can  well  describe  the  observed  continuum  (as  well  as  the 
Balmer  emission;  see  [2]).  However,  given  a  comprehen¬ 
sive  sample  of  Be  stars,  the  model  as  presently  encode<l 
is  less  satisfactory  in  some  cases  (In  particular,  in  the 
case  of  the  X-ray  binaries).  We  feel  these  cases  may 
be  attributed  to  the  present  state  of  sophistication  of 
the  model  and  a  lack  of  coverage  of  parameter  space. 
Doth  problems  we  expert  to  surmount  with  the  use  of 
transputers. 

The  time  to  calculate  50  constant  thickness  mod¬ 
els  consisting  of  6  rings  for  200  wavelength  points  is 
shown  in  Figure  5.  The  graph  is  of  VAX  CPU  time 
against  elapsed  time  on  the  Computing  Surface,  so  that 
an  ob-served  speed  up  of  about  50  times  can  be  observed 
on  a  pV^AXII,  when  considering  the  event  decomposi¬ 
tion.  The  transputer  array  has  the  advantage  that  fiir- 
ther  transputers  can  be  used  with  only  minor  modifica¬ 
tions  to  the  harness  code  to  provide  greater  speedups 
for  larger  models.  This  will  be  required  for  the  future 
enhancements  to  the  model  and  to  explore  adequately 
the  parameter  space. 
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Figure  5:  Timing  comparison  for  50  ring  constant  thick¬ 
ness  model. 

The  efficiency  [7]  of  the  implementation  is  61% 
for  the  algorithmic  decomposition  and  >  98%  for  the 
event  decomposition.  This  does  not  include  the  ma.s- 
tcr  processor,  only  the  number  of  workers.  For  a  low 
number  of  models  the  efficiency  of  the  event  decomposi¬ 
tion  can  drop  to  below  00%  due  to  work  not  )>eing  load 
balanced  while  the  pipe  is  being  emptied.  The  effect 
of  this  on  our  implementation  ha,s  been  reduced  since 
data  is  not  over  buffered. 

6  Conclusions 

A  substantial  increase  in  the  performance  of  running 
mathematical  models  can  be  achieved  by  using  concur¬ 
rent  processors.  The  values  given  siiggest  1-2  orders  of 
magnitude  reductions  in  run  times  are  easily  accessible 
when  comparing  the  performance  of  a  //V.4XII  and  an 
array  of  T800  transputers. 

Wc  note  that  event  parallelisation  is  an  ea.sy 
fornj  of  parallelisation  to  imjdement  but  has  disadvan¬ 
tages  in  the  amount  of  memory  required  by  each  proces¬ 
sor.  Algorithmic  decomposition  requires  more  effort  to 
implement  and  it  can  be  difficult  to  obtain  high  efficien¬ 
cies.  although  it  reqtiires  less  memory  per  processor  and 
ran  reduce  turnaround  time.  Tlie  same  enhaucements 
may  be  enjoyed  l)y  many  other  numerically  intensive 
computational  problems. 
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Abstract 

This  large  scale  application  combines  several  areas  of  re¬ 
search  to  develop  computational  models  for  simulating  the 
failure  mechanisms  of  composite  materials  consisting  of 
brittle  fibers  (such  as  carbon)  embedded  in  a  matrix  ma¬ 
terial  (such  as  epoxy  resin).  The  simulations  combine  the 
ideas  of  structural  stress  analysis,  numerical  linear  alge¬ 
bra,  and  visualization  techniques  to  model  the  behavior  of 
fibrous  composites  under  uniaxiad  tensile  load.  This  will 
allow  laboratory  experiments  to  be  extrapolated  more  ac¬ 
curately  to  teal  applications,  providing  an  enhanced  ca¬ 
pability  to  optimize  designs  of  large  structures  made  of 
composite  materials  with  less  extensive  and  costly  exper¬ 
imental  programs.  Further,  system  performance  and  re¬ 
liability  may  be  improved  substantially.  In  this  paper  a 
brief  discussion  of  the  theory  of  composite  materials  as  it 
relates  to  the  simulations  will  first  be  given.  Next  the  pro¬ 
cedures  used  to  generate  and  analyze  the  structure  will  be 
presented.  The  computational  techniques  used  to  perform 
the  simulation  will  be  given  as  well  as  results  from  selected 
test  cases.  A  summary  of  results  and  future  directions  in 
this  research  will  be  given  at  the  end  of  the  paper. 

Introduction 

Composite  materials  consisting  of  high-strength, 
high-stiffness  fibers  embedded  in  a  matrix  are  lighter 
than  traditional  materials,  such  as  metals  or  wood,  and 
are  of  considerable  interest  in  current  engineering  prac¬ 
tice.  However,  composites  are  not  being  used  as  much 
cis  they  should  be.  “The  bcisic  reason  surely  is  the  un¬ 
certainly  that  exists  in  determining  their  strength  and 
safe-operating  lifetime  in  service  conditions  —  partic¬ 
ularly  when  defects  could  be  present  ”  [9].  It  is  the 
fibers,  typically  of  carbon,  boron,  or  Kevlar,  that  give 
the  material  its  uniaxial  tensile  strength  parallel  to 
the  fiber  direction.  An  understanding  of  the  failure 
processes  of  such  materials  has  been  pursued  for  a 
number  of  years  by  many  researchers,  including  Rosen 
[11,12],  Harlow  and  Phoenix  [7,8],  Wagner,  Phoenix, 
and  Schwartz  [16],  Smith  [13,14],  Durham,  Lynch,  and 
Padgett  [3,4],  and  others.  Still  a  fully  satisfactory  the¬ 
ory  of  composite  failure  is  yet  to  be  achieved.  It  is 
clear  that  a  general  theory  providing  a  bridge  between 


standard  laboratory  test  procedures  and  actual  appli¬ 
cations  is  highly  desirable. 

The  structure  of  a  single  unidirectional  lamina  of 
brittle  fibers  in  a  matrix  that  will  be  utilized  is  based 
mainly  on  Rosen’s  [11]  work.  The  unidirectional  lam¬ 
ina  to  be  treated  consists  of  parallel  fibers  in  an  other¬ 
wise  homogeneous  matrix  material.  There  is  a  bond¬ 
ing  of  the  fiber  surfaces  with  the  matrix  material  which 
tends  to  transfer  load  to  other  fibers.  Rosen  [11]  de¬ 
scribed  the  “ineffective  length”  6  as  the  length  of  seg¬ 
ment  around  a  fiber  break  required  to  redistribute  the 
load  born  by  the  broken  fiber.  The  composite  lamina 
can  be  considered  as  a  chaiin  of  such  segments,  each  of 
length  S,  referred  to  as  the  “chain  of  bundles”  model. 
This  important  feature  provides  a  natural  discretiza¬ 
tion  of  the  composite  in  the  fiber  direction  which  can 
be  exploited  computationally. 

Modeling  Procedures 

In  this  application,  a  pinned-jointed  structure  de¬ 
picting  a  unidirectional  composite  lamina,  shown  in 
Figure  A,  is  utilized  as  a  stress  model. 


Figure  A. 


The  fiber  centers  are  represented  in  the  model  as  ver¬ 
tical  line  seqments,  each  of  length  6,  joined  end  to 
end  while  the  lines  of  load  transfer  through  the  body 
of  the  fiber  and  the  matrix  material  are  represented 
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as  diagonal  line  segments  connected  to  the  fiber  seg¬ 
ments  at  the  joints.  Of  course,  the  actual  fiber  takes 
up  most  of  the  space  between  vertical  line  segments 
which  are  located  at  the  centers  of  the  fibers.  Thus 
the  direct  forces  within  the  fibers  are  experienced  along 
the  vertical  segments  and  shear  forces  are  transferred 
diagonally  across  the  fibers,  through  the  matrix  and 
onto  the  adjacent  fibers.  The  resulting  pinned-jointed 
structure  appears  as  a  triangular  mesh.  A  tensile  load 
is  applied  in  the  fiber  direction  and  the  stresses  in  the 
members  computed. 

The  analysis  of  the  structure  follows  the  math¬ 
ematical  methods  set  forth  by  Strang  [15]  for  solving 
stress  equations  derived  from  jointed  truss  structures. 
Basically,  an  incidence  (or  in  this  case  elongation)  ma¬ 
trix  A  is  formed  from  the  geometrical  nature  of  the 
structure.  The  matrix  A  relates  displacements  at  free 
nodes  to  elongations  in  the  attached  member,  while 
A^  relates  internal  forces  in  the  members  to  exter¬ 
nal  forces  at  the  free  nodes.  In  addition,  a  materi¬ 
als  matrix  C  is  formed  from  the  elastic  constants  for 
each  member.  The  matrix  C  relates  elongations  in 
the  members  to  internal  forces  in  the  members,  by  the 
Youngs’  moduli.  Once  these  two  matrices  are  com¬ 
puted,  the  stiffness  matrix  K  for  the  structure  is  com¬ 
puted  by 

I<  =  A'^CA  (1) 

and  given  the  displacements  at  the  free  nodes  x  the 
force  balance  (equilibrium  equation)  at  the  nodes  / 
is  computed  by  A^CAx  =  /.  Figure  B  outlines  this 
procedure. 

Figure  B  —  The  Stiffness  Equation 

Given 

X  ;  the  displacements  at  the  free  nodes 

Compute 

Ax  the  elongation  of  the  members 

C(Ax)  :  the  internal  forces  in  the  members 

A^ {C Ax)  :  the  external  force  balance  at  the 
nodes 

The  problem  we  wish  to  solve  is  somewhat  more 
computationally  complex,  given  a  force  balance  at  the 
free  nodes  /  we  wish  to  find  the  displacements  at  the 
free  nodes  x  as  well  as  the  internal  stresses  in  the  mem¬ 
bers  s,  which  amounts  to  computing  the  two  matrix 
equations 

iA'^CA)-^f  =  x  and  CAx  =  s.  (2) 

Note  that  the  stiffness  matrix  K  =  A^CA  is  posi¬ 
tive  definite  and  therefore  invertible.  For  the  structure 


in  question,  each  column  of  A  represents  a  particu¬ 
lar  degree  of  freedom  for  the  structure  while  the  rows 
with  non-zero  entries  in  that  column  correspond  to  the 
member  attached  to  that  node.  With  a  regular  order¬ 
ing  of  nodes  and  members  it  is  possible  to  compute,  in 
parallel,  the  individual  columns  of  A.  Note  that  the 
maximum  number  of  nonzero  entries  in  any  column  is 
6,  substantially  reducing  the  memory  requirements  for 
the  column  data.  Likewise,  with  a  regular  ordering  of 
members,  the  individual  elements  of  C  C2ui  be  easily 
computed.  The  storage  structure  we  chose  to  use  for 
each  column  is  given  in  Figure  C. 

Figure  C  —  Column  Storage  Structure 
column.vector: 

rows  =  number  of  nonzero  values  in  column 

index  =  integer  values  for  non-zero  member  numbers 

value  =  values  of  non-zero  entries 

Thus,  each  column  will  occupy  no  more  than  52  bytes. 

The  solution  of  the  equation  {A^CA)~^  f  —  x  is 
the  most  computationally  intensive  portion  of  equa¬ 
tion  2  and  requires  the  most  efficient  parallel  imple¬ 
mentation.  Since  A  and  C  are  sparse  it  would  be 
advantageous  to  use  solution  techniques  which  pre¬ 
serve  their  matrix  structure.  As  the  direct  factoriza¬ 
tion  of  the  stiffness  matrix  would  necessitate  comput¬ 
ing  K  =  A^CA  and  then  factoring  K  by  some  method 
thereby  losing  the  previously  described  sparsity  struc¬ 
ture,  we  chose  to  use  an  iterative  technique  and  work 
directly  with  A  and  C.  Iterative  techniques  in  gen¬ 
eral  have  been  shown  to  be  very  efficient  when  applied 
to  sparse  (as  well  as  full)  matrices  and  implemented 
on  hypercube  multiprocessors  by  Fox,  et.  al.  [5]  and 
Baldwin  [1].  Several  iterative  techniques  were  tried  in 
order  to  solve  equation  2.  The  first  method  we  at¬ 
tempted  to  use  was  the  Gauss-Jacobi  technique  which 
is  easily  adapted  to  the  hypercube  and  shows  good 
convergence  properties  for  a  large  collection  of  posi¬ 
tive  definite  matrices  —  unfortunately  the  stress  ma¬ 
trix  could  not  be  shown  to  always  converge  with  the 
assumptions  of  this  method.  Given  the  generated  ma¬ 
trices  A  and  C,  using  the  Gauss-Seidel  technique  we 
could  not  efficiently  implement  the  back  substitution 
phase  of  the  algorithm. 

Next  the  conjugate  gradient  algorithm  was  ap¬ 
plied  to  the  problem  of  solving  equation  2.  The  conju¬ 
gate  gradient  technique  is  described  in  Fox  et.  al.  [5] 
and  has  the  property  that  it  converges  for  all  positive 
definite  matrices  and  it  is  easy  to  implement  on  sparse 
matrices.  The  given  composition  of  the  A  and  C  ma¬ 
trices  made  the  application  of  the  conjugate  gradient 
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to  the  problem  very  efficient.  A  number  of  precondi¬ 
tioning  techniques,  such  as  Jacobi  (main  diagonal)  pre¬ 
conditioning  as  described  in  [6],  can  be  applied  to  the 
conjugate  gradient  algorithm  to  speed  convergence  of 
the  iterates  but  none  have  as  yet  been  used  in  this  ap¬ 
plication.  The  basic  conjugate  gradient  method  used 
to  solve  Ax  =  b  is  given  in  Figure  D. 


Figure  D  —  The  Conjugate  Gradient  Algorithm 
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Note  that  there  is  only  one  matrix  vector  product  in 
the  conjugate  gradient  algorithm,  which  makes  it  ad¬ 
vantageous  for  use  with  matrices  which  are  sparse. 

Once  the  stresses  in  the  members  are  computed 
they  are  to  compared  with  the  strengths  of  that  fiber 
component  obtained  from  the  brittle  fracture  [4]  or  the 
well  known  Weibull  distribution  for  possible  breakage. 
The  strengths  of  the  fibers  are  computed  using  uni¬ 
form  random  numbers  generated  from  the  parallel  lin¬ 
ear  congruential  method  as  described  in  Fox,  et.  al. 
[5].  The  basic  sequential  linear  congruential  algorithm 
used  is  given  in  Figure  E. 


Figure  E  —  The  Linear  Congruential  Method 

m  =  a  modulus  m  >  0 
a  =  a  multiplier  0  <  a  <  m 
c  =  an  increment  0  <  c  <  m 
Xq  =  an  initial  value  0  <  Xq  <  m 
while  n  <  numberjofjandoms 

Xn+\  —  {aXn  +  c)  mod  m 


The  choice  of  the  values  m,  a,  and  c  will  greatly  affect 
the  randomness  of  the  above  algorithm,  see  Knuth  [9]. 
It  may  also  be  necessary,  depending  upon  the  desired 
randomness  of  the  numbers,  to  add  a  shuflling  algo¬ 


rithm  such  as  the  on  described  by  Bays  and  Durham 

[2]. 

The  above  techniques  define  the  main  computa^- 
tional  techniques  used  in  this  simulation,  which  can  be 
considered  standard.  In  addition,  two  critical  portions 
of  the  simulation  are  balancing  the  computational  load 
between  the  processors  and  visualization  of  the  stresses 
within  the  structure.  For  this  particular  application 
both  of  these  tasks  were  found  to  have  straight-forward 
implementations  on  the  hypercube.  With  regard  to 
load  balancing,  we  wish  to  evenly  distribute  the  work 
involved  in  solving  the  system.  We  first  number  the 
joints  as  well  as  the  members  starting  at  the  upper 
left  hand  corner  of  Figure  A  and  continuing  vertically 
down  then  horizontally  to  the  right  and  use  the  stan¬ 
dard  2  dimensional  coordinate  axis  centered  on  the 
upper  left  joint.  In  addition  each  joint  has  two  de¬ 
grees  of  freedom,  one  in  the  x  direction  and  one  in 
the  y  direction.  This  gives  us  a  regular  ordering  for 
the  starting  configuration  of  the  pinned-joints  and  the 
members.  It  follows  that  there  is  a  mathematical  rela¬ 
tionship  between  the  indices  of  one  joint  and  its  neigh¬ 
bors,  as  well  as  the  index  of  a  member  connecting  two 
joints.  Hence  each  processor  computes,  independently, 
a  portion  of  the  incidence,  as  well  as  a  portion  of  the 
materials  matrix  needed  to  perform  the  matrix-vector 
product  of  the  conjugate  gradient  algorithm.  We  only 
require,  for  reasons  to  be  explained  shortly,  that  both 
degrees  of  freedom  for  one  joint  be  atssigned  to  one  pro¬ 
cessor.  Thus  each  processor  will  have  the  same  num¬ 
ber  of  columns  from  the  incidence  matrix  ±2.  As  the 
diagonal  members  represent  lines  of  force  transfer  we 
may  not  need  to  explicitly  include  them  in  the  display 
of  the  specimen,  rather  only  the  vertical  fiber  mem¬ 
bers  are  displayed  after  the  stress  computation.  This 
serves  to  reduce  the  computations  needed  for  display 
of  the  stresses.  Since  we  have  assigned  both  degrees 
of  freedom  for  one  joint  to  a  processor  we  have  also 
uniquely  determined  the  processor  which  should  draw 
the  vertical  member  above  the  joint.  By  implement¬ 
ing  these  techniques  we  have  eliminated  the  need  for 
communication  between  processors  in  both  the  matrix 
generation  and  the  display  portions  of  the  program. 

As  an  additional  note  on  reducing  communica¬ 
tion  overhead,  the  above  idesis  can  be  used  to  optimize 
communications  in  the  matrix  vector  multiply  opera¬ 
tions.  The  matrix-vector  product  Ax  from  Figure  B 
results  in  a  vector  with  a  length  of  exactly  the  number 
of  members  in  the  structure.  Within  each  processor 
the  individual  multiplications  result  in  vectors  whose 
non-zero  components  represent  the  contribution  to  to¬ 
tal  member  elongation  obtained  from  the  deflection  of 
joints  assigned  to  that  particular  processor.  The  to¬ 
tal  elongation  in  all  members  is  then  the  sum  of  the 


individual  processors  elongations.  However,  depend¬ 
ing  upon  the  size  of  the  hypercube,  as  well  as  the 
size  of  the  problem,  not  every  node  will  necessarily 
be  assigned  joints  whose  deflection  directly  affects  all 
members  in  the  structure.  Thus,  every  processor  can 
calculate  the  range  of  members  directly  affected  by  as¬ 
signed  joint  deflections.  The  processors  can  then  com¬ 
pute,  with  no  communication,  those  processors  which 
act  upon  members  within  its  range.  This  implies  that 
even  though  the  columns  are  mapped  into  a  ring  of 
processors,  it  may  not  be  necessary  to  shift  each  col¬ 
umn  through  all  processors.  In  fact,  as  long  as  more 
than  one  fiber  is  assigned  to  each  processor,  only  one 
shift  in  each  direction  of  the  ring  is  required  to  perform 
the  multiplication. 

The  above  is  a  description  of  the  basic  procedures 
which  are  performed  for  a  force  balance  applied  to  the 
structure.  Once  the  displacements  and  internal  stress 
in  each  member  are  computed,  the  incidence  matrix 
should  be  updated  before  another  force  balance  is  ap¬ 
plied.  Also,  when  a  fiber  breaks,  the  incidence  matrix 
should  be  updated  to  reflect  one  less  member  in  the 
structure.  In  this  fashion  the  mechanics  of  the  struc¬ 
ture  can  be  visualized  at  each  stage  until  complete 
failure  is  reached. 

Computational  Techniques 

The  methods  described  above  were  conceived 
with  the  idea  of  keeping  message  traffic  at  a  mini¬ 
mum.  To  this  end,  collective  hypercube  communica¬ 
tions  where  all  processors  participate,  such  as  combin¬ 
ing  partial  sums  in  an  inner  product  or  broadcasting 
common  data,  are  implemented  using  cube  geodesics 
as  described  in  Fox,  et  al.  [5]  and  Gustafson,  et.  al. 
[6].  Thus  collective  routines  are  0{log(P)),  where  P 
is  the  number  of  processors.  If  the  problem  size  is 
such  that  one  processor  has  the  joints  of  more  than 
one  fiber  then  the  communications  cost  in  the  matrix- 
vector  multiply  is  0(1)  per  iteration.  This  appears 
optimal  for  this  application.  The  primary  area  of  con¬ 
cern  was  that  of  space  in  the  node  processors,  which  is 
limited  to  512K  —  of  which  approximately  48K  is  used 
for  message  buffers,  8K  for  the  node  operating  system, 
and  4K  for  a  jump  table  for  operating  system  traps.  In 
addition,  the  graphics  libraries  expand  the  executable 
by  approximately  50K,  which  is  approximately  20K, 
this  leaves  a  grand  total  of  about  380K  for  data.  Also, 
the  graphical  display  has  a  maximum  size  of  1024  by 
768  pixels,  so  that  one  can  not  use  all  1024  nodes  and 
achieve  maximum  efficiency  in  the  matrix-vector  prod¬ 
uct  as  described  above.  Alternatives  will  be  discussed 
in  the  closing  remarks. 

The  following  algorithms  outline  the  code  for  the 
host  and  node  programs  in  the  current  modeling  envi¬ 


ronment. 


Host  Algorithm  for  Composite  Modeling 
Hi  start  host  timer 

H2  get  composite  data;  size  of  specimen,  and 
Youngs’  moduli 
H3  load  program  on  nodes 
H4  send  composite  data  to  nodes  via  broadcast 
H5  send  forcing  data  to  nodes 
H6  wait  for  timing  information  from  nodes 
H7  stop  host  timer 

H8  display  timing  data  for  host  and  node 


Node  Algorithm  for  Composite  Modeling 
Nl  initialize  timers 

N2  initialize  graphic  display  and  load  color  table 
N3  receive  composite  data  size  of  specimen,  and 
Youngs’  moduli 

N4  perform  connection  mapping  for  1-D  ring 
N5  construct  elongation  and  materials  matrix,  C  and 
A  and  find  maximum  and  minimum  number  of 
processors  to  shift  partial  results  of  matrix-vector 
product  thru 

N6  get  forcing  data  from  host,  / 

N7  perform  conjugate  gradient  to  solve  A^CAx  =  f 
for  displacements  x 

N8  perform  matrix-vector  multiply  CAx  =  s  to  find 
internal  stress  s 

N9  perform  plot  of  stress  data  s 

NlO  slop  timers 

Nil  send  timers  to  host 


The  next  algorithm  details  step  N5  in  the  node 
algorithm  above. 
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Matrix  Generation  Algorithm 


Conjugate  Gradient  Algorithm 


G1  Get  processor  number  and  size  of  cube 
G2  Compute  total  numer  of  joints,  number  of 

joints  per  processor,  and  number  of  processors 
which  get  an  extra  joint 

G3  Find  tile  position  for  this  processor  within  a 
ring 

G4  Find  maximum  and  minimum  member  numbers 
which  are  affected  by  joints  in  this  processor 
G4  For  each  joint  in  this  processor  compute  the 

entries  in  the  incidence  matrix  A  for  both  degrees 
of  freedom,  as  well  as  the  location  for  the  member 
above  this  joint  and  is  breaking  strength 
G5  For  each  member  this  processor  affects,  compute 
the  entries  in  the  materials  matrix  C 
G6  Search  processor  on  the  left  for  processors 
whos  joints  affect  the  same  members  as  this 
processor  noting  how  far  we  have  to  proceed 
G7  Search  processor  on  the  right  for  processors 
whos  joints  affect  the  same  members  as  this 
processor  noting  how  far  we  have  to  proceed 
G8  Combine  results  of  G6  and  G7  above  to  find 

maximum  and  minimum  number  of  shifts  left  and 
right  needed  to  perform  matrix-vector  product 

Once  each  node  completes  the  above  algorithm 
all  static  data  structures  are  set  up  and  all  processors 
know  the  length  of  the  pipe  used  in  the  matrix-vector 
product  in  both  the  left  and  right  directions.  Although 
not  specifically  cited,  algorithm  G  above  uses  several 
utility  routines  extensively.  These  routines  are  given 
below,  however,  because  of  their  intuitive  nature  they 
are  not  detailed. 

Utility  Algorithms  for  the  Generation  Algorithm 
node_to_cart 

maps  a  joint  number  to  (x,y)  coordinates 
node_to_neighbors 

maps  a  joint  number  to  all  neighboring  joints 
nodes_to_members 

maps  two  joint  numbers  to  a  member  number 
member  .to-stress 

maps  a  member  number  to  its  Youngs’  modulus 

The  general  conjugate  gradient  algorithm  was 
presented  earlier,  but  we  have  modified  it  so  that  sub¬ 
routines  can  be  called  to  perform  the  matrix  vector 
products  both  within  the  conjugate  gradient  algorithm 
and  in  the  main  node  algorithm  (step  N8).  Note  the 
reuse  of  /  in  order  to  save  space. 


r  =  P=f 
/=0 

rho  =<  r,  r  > 
iterations  =  0 

not-converged  =  convergence_test(  rho  ) 
while  not-converged 
y  =  Ap 
y  =  Cy 
q  = 

Oi=rho 


r  =  r  —  ap 
rhoprev  =  rho 
rho  =<  r,r  > 

0=rho 

rhoprev 

p=r  +  I3p 

iterations  =  iterations  -f  1 
not-converged  =  convergence_test(  rho  ) 

As  of  this  writing,  we  are  experimenting  with 
visualization  of  the  stress  data  using  several  different 
ideas,  currently  we  are  mapping  the  member  stresses 
to  a  4  bit  color  quantity  while  using  4  bits  of  inten¬ 
sity  to  contrast  the  ratio  of  stress  to  strength.  This 
generates  two  visual  affects  from  the  data. 

Results 

The  following  tables  give  the  runtimes  along  with 
the  speedup  and  efficiency  of  some  selected  specimen 
sizes.  One  should  note  that  as  processors  are  added 
the  runtimes  drop  until  a  point  is  reached  where  a 
processor  needs  to  communicate  with  more  than  one 
processor  on  the  right  and  left  in  the  matrix  vector 
product.  This  phenomenon  occurs  at  low  degrees  of 
freedom,  and  as  more  degrees  of  freedom  are  added 
the  behavior  is  approximately  linear  with  respect  to 
time  until  memory  space  is  exhausted.  In  the  tables 
the  runtimes  are  from  host  timing  data  which  includes 
the  time  to  load  the  program,  send  data  and  their 
associated  waiting  times.  The  table  for  the  single  node 
case  has  values  extrapolated,  denoted  with  a  ♦,  from 
the  known  data.  This  extrapolated  value  is  then  used 
in  subsequent  calculations  for  speedup  and  efficiency 
values.  The  extrapolating  function  is  one  of  the  form 

<  =  ao  +  oi*  +  02®^  +  Use"*. 

This  function  contains  contributions,  the  polynomial 
term,  from  the  matrix-vector  multiply  as  well  as  a 
damping  factor,  the  exponential  term,  to  take  commu¬ 
nications  into  account.  To  obtain  more  of  a  response 
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from  the  data,  logarithms  were  first  applied  and  the 
results  were  then  converted  back  for  listing  in  the  ta¬ 
bles.  As  a  note,  dof  refers  to  the  degrees  of  freedom  of 
the  structure,  and  doc  refers  to  the  dimension  of  the 
hypercube. 


1  node  times 

dof 

runtime 

120 

14 

252 

23 

500 

50 

1530 

215 

2040 

342 

5050 

813* 

10000 

1346* 

25100 

2304* 

50200 

3209* 

99448 

4258* 

2  node  times 

dof 

runtime 

speedup 

efficiency 

120 

13 

1.05 

54 

252 

18 

1.28 

64 

500 

32 

1.56 

78 

1530 

118 

1.82 

91 

2040 

184 

1.86 

93 

4  node  times 

dof 

runtime 

speedup 

efficiency 

120 

12 

1.17 

29.17 

252 

15 

1.53 

38.33 

500 

22 

2.27 

56.82 

1530 

65 

3.31 

82.69 

2040 

98 

3.49 

87.24 

5050 

219 

3.71 

92.81 

1  16  node  times 

dof 

runtime 

speedup 

efficiency 

120 

12 

1.17 

7.29 

252 

13 

1.77 

11.06 

500 

16 

3.13 

19.53 

1530 

27 

7.96 

49.77 

2040 

37 

9.24 

57.77 

5050 

94 

8.65 

54.06 

10000 

248 

5.43 

33.92 

25100 

905 

2.55 

15.91 

32  node  times 

dof 

runtime 

speedup 

efficiency 

120 

13 

1.08 

3.37 

252 

14 

1.64 

5.13 

500 

15 

3.33 

10.42 

1530 

21 

10.24 

31.99 

2040 

26 

13.15 

41.11 

5050 

56 

14.52 

45.37 

10000 

136 

9.90 

30.93 

25100 

472 

4.88 

15.25 

50200 

1401 

2.29 

7.16 

64  node  times  | 

dof 

runtime 

speedup 

efficiency 

252 

14 

1.64 

2.57 

500 

16 

3.13 

4.88 

1530 

19 

11.32 

17.68 

2040 

23 

14.87 

23.23 

5050 

38 

21.39 

33.43 

10000 

80 

16.83 

26.29 

25100 

257 

8.97 

14.01 

50200 

742 

4.32 

6.76 

99448 

1990 

2.14 

3.34 

128  node  times 

dof 

runtime 

speedup 

efficiency 

500 

17 

2.94 

2.30 

1530 

19 

11.32 

8.84 

2040 

20 

17.10 

13.36 

5050 

31 

26.23 

20.49 

10000 

55 

24.47 

19.12 

25100 

143 

16.11 

12.59 

50200 

414 

7.75 

6.06 

99448 

1074 

3.96 

3.10 

8  node  times  j 

dof 

runtime 

speedup 

efficiency 

120 

14 

0.00 

0.00 

252 

14 

1.64 

20.54 

500 

17 

2.94 

36.76 

1530 

40 

5.38 

67.19 

2040 

56 

6.11 

76.34 

5050 

169 

4.81 

60.13 

10000 

473 

2.85 

35.57 

495 


256  node  times 

dof 

runtime 

speedup 

efficiency 

1530 

20 

10.75 

4.20 

2040 

21 

16.29 

6.36 

5050 

28 

29.04 

11.34 

10000 

44 

30.59 

11.95 

25100 

92 

25.04 

9.78 

50200 

255 

12.58 

4.92 

99448 

614 

6.93 

2.71 

512  node  times 

dof 

runtime 

speedup 

efficiency 

2040 

25 

13.68 

2.67 

5050 

31 

26.23 

5.12 

10000 

45 

29.91 

5.84 

25100 

71 

32.45 

6.34 

50200 

198 

16.21 

3.17 

99448 

421 

10.11 

1.98 

'I'lie  following  table  graphically  illustrates  the  runtimes  of 
several  cube  dimensions,  namely  the  0-dimensionaI  •  or 
o,  3-d>'nensionaI  *,  6-dimensional  o,  and  9-dimensionaJ  *. 
The  function  lg{x)  is  the  logarithm  base  2.  The  measured 
values  of  the  0-dimensional  hypercube  are  denoted  with 
a  •  while  the  extrapolated  values  are  denoted  with  a  o. 
Note  that  the  extrapolated  values  for  the  runtimes  does 
not  seem  to  exactly  follow  the  general  pattern  of  the  other 
runtime  data;  however,  given  the  difficult  task  of  extrap¬ 
olating  data  —  the  predicted  values  were  generated  to  be 
sufficiently  conservative  and  most  likely  are  greater  than 
the  extrapolated  values. 

Runtimes  for  4  hypercube  dimensions 
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Summary 


In  this  application  we  have  taken  the  ideas  of  struc¬ 
tural  stress  analysis,  numerical  linear  algebra,  and  visual¬ 
ization  to  produce  a  model  for  composite  materials  research 
on  a  hypercube  multiprocessor.  To  date,  a  great  deal  of 
work  has  been  done  on  providing  a  robust,  fast  code  so 
that  future  enhancements  can  easily  be  incorporated.  The 
nature  of  the  problem  lends  itself  well  to  the  hypercube, 
and  does  provide  many  points  of  efficiency,  in  the  decom¬ 
position  stage  for  example.  We  believe  that  the  code  itself 
will  help  in  providing  new  insights  into  the  failure  pro¬ 
cesses  which  occur  in  composites.  The  main  limiting  fac¬ 
tor  found  in  implementing  these  ideas  is  that  of  memory 
size,  as  this  ultimately  limited  the  size  of  problem  we  could 
run.  Also,  when  dealing  with  this  type  of  simulation  the 
ability  to  graphically  visualize  the  results  of  the  modeling 
procedure  cannot  be  understated.  We  also  feel  that  with 
the  new  generation  of  hypercubes  being  constructed  today 
the  same  ideas  can  be  used  with  greater  success. 

Future  Research 

Our  next  idea  in  this  area  of  research  is  to  provide 
the  ability  to  discretize  the  loading  of  the  structure  in  a 
natural  fashion,  that  is  to  iterate  the  algorithm  outlined  in 
the  node  program.  This  will  allow  us  to  better  visualize 
the  evolution  of  the  structure  up  to  complete  failure.  Also, 
the  ability  to  display  the  results  of  computations  graphi¬ 
cally  on  many  types  of  color  workstations  adds  an  incentive 
to  view  the  hypercube  application  as  a  powerful  ’’number- 
cruncher"  and  allow  the  data  to  be  spooled  and  later  dis¬ 
played  on  a  workstation  more  suitable  for  animated  visu¬ 
alization.  In  the  program  itself  the  future  modifications 
include,  work  on  the  convergence  of  the  conjugate  gradient 
algorithm  by  adding  preconditioners,  working  with  ran¬ 
dom  number  generating  techniques,  and  adding  the  ability 
to  discretize  the  loading.  This  last  item  is  also  of  great 
theoretical  interest  as  the  mechanics  of  the  structure  will 
become  very  unstable  as  complete  failure  is  reached. 
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Abstract 


We  report  on  a  distributed  memory  implemen¬ 
tation  and  initial  applications  of  a  program  for 
calculating  electron-molecule  collision  cross  sec¬ 
tions.  Runs  on  the  Mark  Illfp  hypercube  show 
that  large-grain  MIMD  machines  are  well  suited 
for  these  applications.  Some  results  of  studies 
of  SiaHfl  and  e~— SiF4  collisions  will  be  dis¬ 
cussed. 


I.  Introduction 


We  have  developed  a  distributed  memory  imple- 
mentaticm  of  a  computer  code  which  we  have  been 
using  to  study  the  coUisioos  of  low-energy  elec¬ 
trons  with  molecules.  Here  we  report  on  our  strat¬ 
egy  for  porting  this  code  to  the  JPL/Caltech  Mark 
Illfp  hypercube,  our  experiences  with  the  parallel 
conversion,  and  some  initial  results  which  illus¬ 
trate  the  level  of  performance  achieved.  The  orig¬ 
inal  FORTRAN  program  is  based  on  a  multichan¬ 
nel  extension  of  the  variational  principle  for  colli¬ 
sions  originally  introduced  by  Schwinger  [l].  This 
code,  which  currently  runs  in  production  mode  on 
CRAY  machines,  has  been  used  extensively  in  re¬ 
cent  years  to  study  both  elastic  and  inelastic  scat¬ 
tering  of  low-energy  electrons  by  molecules  such 
as  H),  N3,  CO,  H3O,  CH4,  C3H4,  and  CjHe. 

Our  motivations  for  bnilding  a  hypercube  ver¬ 
sion  of  our  code  for  studying  electron-molecule 
collisions  include,  on  the  one  hand,  the  high  cost 
of  cycles  on  CRAY-type  machines  and  their  in¬ 
herent  limitations  in  expected  CPU  throughput 
due  to  the  recursive  character  of  the  computation¬ 
ally  intensive  step  of  the  calculations,  and  on  the 
other  hand,  the  potentially  high  performance  of 
large-grain  MIMD  machines  such  as  the  NCUBE, 
iPSC,  or  the  Mark  Illfp  for  this  application,  whose 
structure  lends  itself  natnraUy  to  a  MIMD  archi¬ 


tecture.  The  high-performance  and  cost-effective 
computing  offered  by  these  machines  are  enhanc¬ 
ing  onr  ability  to  study  cross  sections  for  collisions 
of  electrons  with  industrially  important  gases,  e.g., 
C^Fe,  SiaHe,  and  CF3H.  Such  cross  sections  play 
an  important  role  in  modelling  low-temperature 
plasmas  used  in  plasma-assisted  etching  and  de¬ 
position  in  microelectronic  fabrication. 

n.  Background 

The  collision  of  an  electron  with  a  molecular  tar¬ 
get  A  may  be  illustrated  schematically  as 

*  (-^mi  ^m)  +  A  ►  e  (En,  kn)  +  A  , 

where  the  electron  initially  travels  with  kinetic 
energy  Em  along  the  direction  specified  by  the 
vector  km,  uid,  foUowing^the  collision,  leaves  the 
molecule  along  direction  with  energy  En-  The 
asterisk  on  A  indicates  that  the  molecule  may 
be  rotationally,  vibrationally,  or  electronically  ex¬ 
cited  by  the  collision,  in  which  case  En  <  Em', 
collisions  for  which  En  =  Em  are  referred  to  as 
elastic. 

The  Schwinger  multichannel  (SMC)  procedure 
(2,3)  is  a  variational  method  specifically  formu¬ 
lated  for  obtaining  the  probabilities,  or  cross  sec¬ 
tions,  for  low-energy  electron-molecule  collision 
events,  including  elastic  scattering  and  vibrational 
or  electronic  excitation.  The  SMC  method  is  ap¬ 
plicable  to  molecules  of  arbitrary  geometry,  and  is 
capable  of  incorporating  effects  arising  iiom  po¬ 
larisation  of  the  target  by  the  incident  electron, 
which  are  particularly  important  at  the  lowest  en¬ 
ergies  (approximately  0-5  eV). 

In  the  SMC  procedure,  the  scattering  amplitude 
whose  square  modulus  is  proportional 
to  the  cross  section,  is  obtained  in  the  form 

f{km,kn)  = 

(^-%<XyiyiSn(kn)). 
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where  u  an  (iV  +  l)-electron  interaction- 

free  wave  function  of  the  form 


W  is  the  interaction  potential  between  the  electron 
and  the  molecular  target,  and  the  {N  l)-electron 
functions  Xi  are  Slater  determinants  which  form 
a  basis  set  for  approximating  the  exact  scatter¬ 
ing  wave  functions  and  The 

(A-Mo  are  elements  of  the  inverse  of  the  matrix 
representation  in  the  basis  Xi  of  the  operator 


yl(+)  =  l(pv  +VP)-VG^/W- 

it 


JV  +  1 


Here  P  is  the  projector  onto  open  (energetically 
accessible)  electronic  states, 


r€op«n 


is  the  (N  -h  l)-electron  Green’s  function  pro¬ 
jected  onto  open  channels,  and  ft  —  (E  —  ff), 
where  E  is  the  total  energy  of  the  system  and  H 
is  the  full  Hamiltonian. 


In  the  present  implementation,  the  Slater  de¬ 
terminants  Xt  are  formed  from  molecular  orbitals 
which  are,  in  turn,  combinations  of  Cartesian 
Gaussian  orbitals 

Nt^n(x-A,Y(y-A^r(z-A,r 

xexp(-a|r-i4|^), 

which  are  commonly  used  in  molecular  electronic- 
structure  studies.  With  this  choice,  all  matrix  el¬ 
ements  needed  in  the  evaluation  of  /(£m,£n)  can 
be  obtained  analytically,  except  those  involving 
the  Green’s-function  term  VGpW.  These  terms 
are  evaluated  numerically  via  a  momentum-space 
quadrature  procedure  [4].  Once  all  matrix  ele¬ 
ments  are  calculated,  the  final  step  in  the  calculir 
tion  is  solution  of  a  system  of  linew  equations  to 
obtain  the  scattering  ampUtnde  /(hm,£i»). 

The  computationally  intensive  step  in  the  above 
formulation  is  the  evaluation  of  large  numbers  of 
so-called  *primitive”  two-electron  integrals 

(a^lVhfc)  = 

j  j  d^ficPf 2  a(fi)^(fx)^7(f3)e‘*  ''> 


for  all  combinations  of  Cartesian  Gaussians  a, 
and  7,  and  for  a  wide  range  of  ik  in  both  mag¬ 
nitude  and  direction.  These  integrals  are  eval¬ 
uated  analytically  by  an  intricate  “black  box* 
comprising  approximately  two  thousand  lines  of 
FOOTRAN.  A  typical  calculation  might  require 
10^  to  10^^  calls  to  this  integral-evaluation  suite, 
consuming  roughly  80%  of  the  total  computation 
time.  Once  the  primitive  integrals  are  obtained, 
they  are  assembled  in  appropriate  linear  combi¬ 
nations  to  yield  the  matrix  elements  appearing  in 
the  expression  for  The  original  CRAY 

code  performs  this  procedure  in  two  steps:  first, 
a  repeated  linear  transformation  to  integrals  in¬ 
volving  molecular  orbitals,  followed  by  a  transfor¬ 
mation  from  the  molecular-orbital  integrals  to  the 
physical  matrix  elements  involving  Slater  deter¬ 
minants.  The  latter  step  is  equivalent  to  an  ex¬ 
tremely  sparse  linear  transformation  whose  coeffi¬ 
cients  are  determined  in  an  elaborate  subroutine 
with  a  complicated  logical  flow. 

m.  Concurrent  Implementation 


The  nec-sssity  of  evaluating  large  numbers  of 
“primitive*  two-electron  integrals  makes  the  SMC 
procedure  a  natural  candidate  for  parallelisation 
on  a  MIMD  machine  such  as  the  Mark  Illfp  hy¬ 
percube.  The  large  memory  and  general-purpose 
processors  of  the  Mark  Illfp  make  it  feasible  to  dis¬ 
tribute  the  “black  box*  integral  evaluator  across 
the  processors  and  to  divide  up  the  evaluation  of 
the  primitive  integrals  among  4ll  the  processors. 
In  planning  the  decomposition  of  the  set  of  inte¬ 
grals  onto  the  nodes  of  the  hypercube,  two  prin¬ 
cipal  issues  must  be  considered.  First,  the  num¬ 
ber  of  integrals  required  is  such  that  not  all  can 
be  stored  in  memory  simultaneously,  and  certain 
indices  must  therefore  be  processed  sequentially. 
Second,  the  transformation  from  primitive  inte¬ 
grals  to  physical  matrix  elements,  which  necessar¬ 
ily  involves  interprocessor  communication,  should 
be  as  efficient  and  transparent  as  possible.  With 
both  of  these  considerations  in  view,  the  approach 
chosen  was  to  configure  the  hypercube  as  a  logi¬ 
cal  two-torus,  to  which  is  mapped  an  integral  ma¬ 
trix  whose  columns  are  labeled  by  Gaussian  pairs 
(a,  0),  and  whose  rows  are  labeled  by  momentum 
directions  Jb;  the  indices  |k|  and  y  are  processed 
sequentially. 

Given  this  choice  of  data  decomposition,  a  design 
for  the  parallel  transformation  procedure  must  be 
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choMn.  Direct  ernttUtion  of  the  eequentiel  code — 
that  ia,  traneformatioa  fint  to  molecnlar-orbital 
integrals  and  then  tc  physical  matrix  elements — is 
undesirable,  becanse  the  latter  step  wouki  entail 
an  intricate  parallel  routine  governing  the  com¬ 
plicated  flow  of  a  relatively  limited  amount  of 
data  between  processors.  The  potential  for  cod¬ 
ing  errors  would  be  unacceptably  high.  Instead, 
the  two  transformations  are  combined  into  a  sin¬ 
gle  step  by  using  the  logical  outline  of  the  origi¬ 
nal  molecular-orbital-io-physical-matrix-element 
routine  in  a  distributed  version  of  the  CRAY  s^ 
quential  routine  which  builds  a  distributed  tran^ 
formation  matrix.  The  combined  transformations 
are  then  accomplished  by  a  series  of  large,  almoet- 
full  complex-arithmetic  matrix  multiplications  di¬ 
rectly  on  the  primitive-integral  data  set.  The 
transformation  steps  and  associated  interproces¬ 
sor  communication  are  thus  localised  and  'hid¬ 
den*  in  large  parallel  multiplications,  which  are 
known  to  be  efficient  on  hypercube  architectures 
[13].  Besides  efficiency,  benefits  of  this  approach 
include  simplicity  and  enhanced  portability  of  the 
resulting  code. 

The  remainder  of  the  parallel  implementation  in¬ 
volves  relatively  straightforward  modifications  of 
the  sequential  CRAY  code,  with  the  exception  of 
a  series  of  integrations  over  angles  i  arising  in  the 
evaluation  of  the  VG^V  matrix  elements,  and 
of  the  solution  of  a  system  of  linear  equations  in 
the  final  phase  of  the  calculation.  The  angular  in¬ 
tegration,  done  by  Gauss-Legendre  quadrature,  is 
compactly  and  efficiently  coded  as  a  distributed 
matrix  multiplication  of  the  form  Adiag(wj)A^. 
The  integration  over  |£|  is  essentially  accomplished 
in  SIMD  fashion.  The  solution  of  the  linear  system 
wUl  be  performed  by  a  distributed  LU  solverll4| 
modified  for  complex  arithmetic,  implementation 
of  which  is  under  wsy.  This  will  make  feasible 
solution  of  systems  on  the  <Hder  of  2000  x  2000, 
on  current  hardware.  However,  applications 
to  date,  the  sise  of  the  linear  systems — less  than 
100  X  100 — has  allowed  use  of  the  original  sequen¬ 
tial  solver  running  either  on  the  host  or  on  a  single 
node. 

rV.  Performnneu 

No  attempt  has  been  made  to  benchmark  the 
parallel  electron  scattering  code  in  detail.  Such 
an  exercise  is  irrelevant  here,  because  the  integrab 
are  embarrassingly  parallel,  and  matrix  multiplies 


Wg.  1  Time  for  computation  and  transforma¬ 
tion  of  a  complete  set  of  two-electron  integrals, 
for  fixed  |k|,  as  a  function  of  the  Mark  mfp  hy- 
percube  dimension  (squares).  Also  shown  is  an 
exponential  best  fit  (solid  line),  with  parameters 
as  indicated  in  the  figure,  and,  for  comparison,  the 
single-processor  CRAY  Y-MP  time  (dashed  line). 

and  LU  decomposition  have  been  previously  as¬ 
sessed  .  However,  its  performance  relative  to  the 
original  CRAY  code  has  been  assessed  through  a 
series  of  calculations  on  Mark  Illfp  hypcrcubes  of 
dimensions  from  0  (a  single  processor)  to  6  (64 
processors),  the  largest  currently  available.  The 
same  calculation  was  also  performed  on  a  CRAY 
Y-MP  with  the  original  code.  For  these  compar¬ 
isons,  a  modest  but  realistic  ‘production  run*  for 
the  CO  molecule  using  32  Cartesian  Gaussian  or¬ 
bitals  was  chosen.  Results  are  presented  in  Figs.  1 
and  2.  Figure  1  shows  the  time  required  for  a  sin¬ 
gle  "quadrature  shell*  of  integrals,  t.e.,  for  eval¬ 
uation  and  transformation  of  a  complete  set  of 
two-electron  integrals  for  a  fixed  magnitude  |ib|, 
as  a  function  of  the  cube  dimension.  All  I/O  and 
cods  loading  are  included  in  timings.  The  Weitek 
XL  floating  point  processor  performs  the  primi¬ 
tive  integral  calculation  at  roughly  0.35  Mflops  per 
processor.  The  transformation  to  physical  matrix 
elements  proceeds  at  1.5  Mflope/processor.  The 
data  of  Fig.  1  are  presented  in  an  alternative  fash¬ 
ion  in  Fig.  2,  which  shows  speedup  as  a  function 
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Fig.  3  Sp««dap  M  «  fanction  of  hypercnb*  di- 
meuion  for  the  Mm«  cm*  m  Fig.  1.  EfBcicaciea 
u«  indicated  for  each  dimenaion.  For  compariaon, 
ideal  2*  apeednp  ia  ahown  by  the  daahed  line. 

of  enbe  dimenaion,  along  with  efficienciea  (the  ra> 
tio  of  achieved  to  ideal,  or  2'*,  apeednp).  The 
(aingle-proceaaor)  Y-MP  time  ia  mdicated  by  the 
daahed  line.  Aa  aeen  ifirom  the  fignrv,  the  Mark 
infp  performance  anrpaaaea  that  achieved  on  the 
CRAY  in  going  from  10  to  32  proceaaora.  The  aolid 
line,  which  ia  aa  exponential  beat  fit,  evidently  de- 
acribea  the  obeerved  Mark  IIl^  timea  well  over  the 
range  of  hyperenbe  dimenaiona  atndied,  although 
the  fact  that  the  time  decreaaea  aa  2'**  *^*  rather 
than  2~"  indicatea  that  the  apeednp  achieved  ia 
leaa  than  ideal.  An  analogona  plot  for  the  total 
CO  computation  time  on  8  to  64  proceaaora  (not 
ahown)  reveab  identical  characteriatica,  lefiecting 
the  dominance  of  the  two>electron  integrala  in  the 
calculation.  Aa  expected  fw  a  problem  of  fixed 
aiae,  the  efficiency  declinea  aa  the  hyperenbe  di> 
menaion  increaaea  (5),  but  remaina  reaaonable  over 
the  range  atndied.  Moat  importantly,  on  64  pro- 
ceaaora,  we  arc  ontperforming  the  Y<MP  by  a  fac¬ 
tor  of  3  on  a  amall  problem.  Larger  problema  will 
provide  a  greater  p^ormance  differential. 

V.  Solactad  Ronulta 

After  development  and  debugging,  the  concur¬ 


Fig.  8  Differential  croea  acetion  for  elaatic  acat- 
tcring  ot  4eV  electrona  by  the  SiaH«  molecule.  The 
solid  line  ahowa  theoretical  reanlta  obtained  on  the 
Mark  mfp:  the  circles  are  measured  values  (Ref. 

17)). 


rent  SMC  code  waa  applied  to  a  number  of  elaa¬ 
tic  electron-scattering  problems,  with  an  emphasis 
on  polyatomic  gases  of  interest  in  low-temperature 
plasma  applications  [6].  Some  of  the  systems 
examined  to  date  are  ethylene  (CaHs),  ethane 
(CaHe),  diailanc  (SiaHe),  and  tetraflnoroailane 
(SiFs).  Dluatrative  results  are  presented  in  Figs. 
3-8,  along  with  experimental  or  other  data  for 
comparison  [7-11].  Figure  3  shows  the  differen¬ 
tial  croaa  section— that  ia,  scattering  jMobability 
as  a  function  of  the  angle  t  between  incident  and 
outgoing  directions — for  4  eV  electrona  colliding 
elastically  with  SiaH#  molecules.  Agreement  with 
recent  experimental  results  (7)  ia  excellent.  One 
point  to  observe  is  the  significant  probability  of 
scattering  in  the  high-angle,  or  near-backward,  di¬ 
rections,  few  which  experimental  data  are  unavail¬ 
able.  Examinatbn  of  Fig.  3  suggesU  that  extrapo¬ 
lation  of  the  measured  values  to  this  region  is  likely 
to  underestimate  the  cross  section.  This  fact  is  sig¬ 
nificant  because  such  backscattering  makes  a  large 
contribution  to  the  transfer  of  momentum  from 
the  electrons  to  the  gas  molecules  and  is  therefore 
important  ia  the  numerical  modeling  of  plasmas 
and  discharges. 
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Fig.  4  Momentnni'trftnifer  or  diffoeion  croee 
section*  for  low-energy  electrons  colliding  with 
SiaHe.  Shown  are  the  present  results  (solid  line), 
estimated  values  [8]  (long  dashes),  and  derived  vat 
ues  [9]  (short  dashes). 


The  large  backscattering  probability  indicated 
in  Fig.  3  contributes  to  the  peak  in  the  SiaHe 
momentum-transfer  crocs  section— essentially  a 
weighted  integral  over  the  differential  crocs 
section — shown  in  Fig.  4  as  a  function  elec¬ 
tron  energy.  The  dashed  curves  in  Fig.  4,  which 
represent  estimated  [8)  and  indirectly  derived  [9] 
momentum-transfer  cross  sections,  appear  to  be 
the  only  previously  published  values  for  this  indnik 
trially  important  molecule,  highlighting  the  need 
for  calcnlations  the  present  type. 

As  a  further  example  of  the  applications  per¬ 
formed  to  date.  Fig.  5  shows  preliminary  results 
for  the  angle-integrated  elastic  scattering  cross 
section  of  SiF4,  along  with  two  measurements 
[10,11]  of  the  total  scattering  cross  section,  which 
should  of  course  be  larger  than  the  elastic  crocs 
section.  Considering  the  uncertainties  in  the  mea¬ 
surements  and  the  need  for  further  refinement  of 
the  theoretical  result,  the  agreement  in  magnitude 
and  overall  shape  of  the  cross  sections  are  quite 
encouraging. 


Electron  Energy  (eV) 

Fig.  5  Elastic  electron  scattering  cross  section 
for  SiF4  obtained  on  the  Mark  mfp  (solid  line). 
Also  shosm  are  total  scattering  cross  section  mea- 
surements  of  Refs.  [10]  (squares)  and  (ll|  (circles). 

VI.  ConehiMonjs  and  Fntuio  Prospects 

The  concurrent  implementation  of  a  large  se¬ 
quential  code  which  is  in  production  on  CRAY- 
type  machines  is  an  example  of  challenges  which 
are  likely  to  become  increasingly  frequent  as  com¬ 
mercial  parallel  machines  proliferate  and  as  more 
and  more  "mainstream*  computer  users  are  at¬ 
tracted  by  their  potentiaL  Several  lessons  which 
emerge  from  the  port  of  the  SMC  code  may  prove 
usefnl  to  those  contemplating  similar  projects. 
One  is  the  value  of  focusing  on  the  concurrent  im¬ 
plementation  of  the  existing  cod*  [llj  and,  so  far 
as  possible,  maintaining  the  stmctnre  and  code 
from  the  sequential  program.  The  development 
of  an  understanding  of  the  original  CRAY  code 
and  its  organisation  is  a  demanding  part  of  such 
a  parallelisatioa.  On  the  other  hand,  majw  issues 
of  structure  and  organuation  which  bear  directly 
on  the  parallel  conversion  deserve  very  cardul  at¬ 
tention.  In  the  SMC  case,  the  principal  such  issue 
was  how  to  implement  efficiently  the  transforma¬ 
tion  from  primitive  integrals  to  physical  matrix 
elements.  A  poor  parallelisation  of  the  transfor¬ 
mation  could  ofiiMt  the  high  efficiency  of  the  prim¬ 
itive  integral  calculation.  The  solution  arrived  at 
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not  only  implied  that  a  eignificant  departure  from 
the  sequential  code  was  warranted  but  also  sag* 
gested  the  data  decomposition.  One  conclusion 
is  that  similar  code  reorganisation-building  and 
multiplying  large  matrices-would  improve  the  ex¬ 
ecution  on  the  CRAY.  In  contrast,  the  primitive 
integral  evaluation  could  not  be  significantly  im¬ 
proved  for  the  CRAY  because  it  is  a  recursive  pro> 
cedure;  however,  it  was  easily  parallelised  for  a 
large  grain  machine.  A  final  point  worth  mention¬ 
ing  is  that  the  conversion  was  greatly  facilitated 
by  an  environment  which  fostered  collaboration 
between  workers  familiar  with  the  original  code 
and  its  application  and  workers  adept  at  paral¬ 
lel  programming  practice,  and  in  which  there  was 
ready  access  both  to  smaller  machines  for  debug¬ 
ging  runs  and  to  larger,  production  machines. 

Plans  for  the  near  future  include  the  implemen¬ 
tation  of  the  distributed  LU  solver,  already  men¬ 
tioned,  and  the  implementation  of  portions  of  the 
sequential  code  necessary  for  studies  of  electronic 
excitation  and  for  employing  molecular  symme¬ 
try  to  reduce  computation.  Subsequent  steps  will 
probably  include  optimisation  of  key  sequential 
subroutines  and  transfer  of  the  code  to  other  par¬ 
allel  machines  as  they  become  available. 
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Abstract 

0^tk  Ridge  National  Laboratory  has  embarked  on  sev¬ 
eral  computational  orand  Challenges,  which  require 
the  close  cooperation  of  physicists,  mathematicians, 
and  computer  scientists.  One  of  these  projects  is 
the  determination  of  the  material  properties  of  alloys 
from  first  principles  and,  in  particular,  the  electronic 
structure  of  high-temperature  superconductors. 

The  physical  basis  for  high  Tc  superconductivity 
is  not  well  understood.  The  design  of  materials  with 
higher  critical  temperatures  and  the  ability  to  carry 
higher  current  densities  can  be  greatly  facilitated  by 
the  modeling  and  detailed  study  of  the  electronic 
structure  of  existing  superconductors. 

While  the  present  focus  of  the  project  is  on  super¬ 
conductivity,  the  approach  is  general  enough  to  per¬ 
mit  study  of  other  properties  of  metallic  alloys  such 
as  strength  and  magnetic  properties. 

This  paper  describes  the  progress  to  date  on  this 
project.  We  include  a  description  of  a  self-consistent 
KKR-CPA  method,  parallelization  of  the  model,  and 
the  incorporation  of  a  dynamic  load  balancing  scheme 
into  the  algorithm.  We  also  describe  the  develop¬ 
ment  and  performance  of  a  consolidated  KKR-CPA 
code  capable  of  running  on  CRAYs,  workstations,  and 
several  parallel  computers  without  source  code  mod¬ 
ification. 

Performance  of  this  code  on  the  Intel  iPSC/860  is 
also  compared  to  a  CRAY  2,  CRAY  YMP,  and  several 
workstations.  The  code  runs  at  over  1.6  Gflops  on 
a  128  processor  iPSC/860.  Finally,  some  density  of 
state  calculations  of  two  perovskite  superconductors 
are  given. 

’This  research  was  supported  by  the  Applied  MathemMical 
Sciences  Research  Program,  Office  of  Energy  Reseeuch,  U.S. 
Department  of  Energy,  under  contract  DE-AC05-840R21400 
with  Martin  Marietta  Energy  Systems,  Inc. 


1  Introduction 

The  discovery  of  high  temperature  superconductiv¬ 
ity  in  1986  has  provided  the  potential  of  spectacu¬ 
larly  energy-efficient  power  transmission  technologies, 
ultra-sensitive  instrumentation,  and  other  devices  us¬ 
ing  phenomena  unique  to  superconductivity.  Each 
year  new  materials  are  found  to  add  to  the  faunily  of 
existing  high  temperature  superconductors.  In  gen¬ 
eral  these  materials  are  difficult  to  form  and  use,  and 
some  of  the  superconducting  compounds  are  unsta¬ 
ble.  These  difficulties  are  exacerbated  by  the  lack  of 
an  accepted  theory  explaining  superconductivity  at 
the  higher  temperatures. 

To  further  our  understanding  of  the  behavior  of 
solids  in  general  and  superconductors  in  particular, 
the  quantum  mechanical  laws  have  been  formulated 
into  sophisticated  computer  algorithms  which  can 
predict  from  first  principles  the  structural,  vibra¬ 
tional,  and  electronic  properties  of  matter. 

Present  calculations  of  the  electronic  structure  of 
real  materials  usually  employ  a  mean  field  approxi¬ 
mation  in  which  each  electron  is  viewed  as  moving 
independently  in  a  self-consistent  potential  due  to 
ail  of  the  electrons  and  nuclei.  According  to  den¬ 
sity  functional  theory,  it  is  possible  to  express  the  en¬ 
ergy  of  any  system  of  electrons  and  nuclei  as  a  unique 
functional  of  the  electron  density  [1,2,3].  Since  this 
functional  is  not  known  exactly,  it  is  usually  approx¬ 
imated  by  that  appropriate  to  a  homogeneous  elec¬ 
tron  gas.  This  locsd  density  approximation  to  den¬ 
sity  functional  theory  has  been  very  successful  when 
applied  to  metallic  and  semiconducting  systems,  but 
it  appears  inadequate  to  explain  important  physical 
phenomena  such  as  optical  band  gaps  and  supercon¬ 
ductivity  found  in  transition  metal  oxides. 

More  sophisticated  treatments  of  the  many  electron 
problem  are  possible  but  have  not  been  attempted 
previously  because  the  Green’s  function  and  the  sus- 
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ceptibiiity  function  that  are  needed  to  construct  the 
electron  self-energy  are  very  difficult  to  calculate  for 
real  systems,  especially  those  with  narrow  bands  such 
as  transition  metal  oxides. 

Our  approach  is  based  on  theoretical  advances 
growing  out  of  work  on  the  Korringa,  Kohn,  and  Ros- 
toker  coherent  poUiitial  approximation  (KKR-CPA) 
theory  of  alloys  and  magnetism  [4,5,6].  The  advan¬ 
tage  in  using  the  KKR-CPA  approach  is  that  it  di¬ 
rectly  yields  the  Green’s  function  for  the  system  and 
thereby  a  direct  way  of  calculating  susceptibilities. 

The  effects  of  disorder  are  treated  in  the  CPA, 
which  is  an  analytic  technique  for  calculating  the 
configurationally  averaged  Green’s  function  [7].  The 
KKR  theory  is  the  natural  method  for  implementing 
the  CPA,  because  it  is  a  Green’s  function  method  and 
there  is  a  natural  separation  between  the  lattice  and 
potential. 

Over  the  last  couple  of  years,  researchers  at 
ORNL  and  their  colleagues  have  developed  a  non¬ 
self-consistent  semi-relativistic  KKR-CPA  computer 
code  that  can  handle  multiple  atoms  per  unit  cell. 
The  code  has  wide  applicability  to  situations  in  which 
some  form  of  substitutional  disorder  plays  an  im¬ 
portant  role,  including  metallic  alloys,  high  tem¬ 
perature  superconducting  compounds,  metallic  mag¬ 
netism,  and  metal-insulator  transitions. 

There  are  three  primary  reasons  for  parallelizing 
this  code.  First,  the  KKR-CPA  calculations  are  com¬ 
putationally  intensive.  It  commonly  requires  10  hours 
of  CPU  time  on  a  CRAY  2  to  perform  a  single  KKR- 
CPA  calculation.  It  has  been  estimated  that  over 
1000  hours  of  CRAY  CPU  time  would  be  needed  to 
complete  a  single  self-consistent  computational  exper¬ 
iment.  The  turn-around  time  for  such  experiments 
makes  them  prohibitive  on  existing  serial  computers. 

Second,  the  KKR-CPA  algorithm  has  a  few  points 
of  natural  parallelism  that  can  be  exploited  to  in¬ 
crease  computationad  throughput.  The  point  we  will 
exploit  b  the  calculation  of  the  Density  of  States 
(DOS)  at  a  given  energy  level.  In  order  to  calculate 
the  Fermi  level,  it  is  necessary  to  calculate  the  DOS  at 
over  a  hundred  energy  levels.  Each  of  these  DOS  can 
be  calculated  independent  of  the  other  energy  levels. 

Third,  the  availability  of  a  pariillel  Gflop  computer, 
iPSC/860,  has  made  it  feasible  and  attractive  to  de¬ 
velop  an  efficient  parallel  version  of  the  KKR-CPA 
code. 

The  modifications  to  the  KKR-CPA  code  were 
made  in  such  a  way  that  the  code  could  still  be  run  on 
CRAYs  and  scientific  workstations.  Having  only  one 
consolidated  code  has  made  the  problems  of  software 
changes  and  data  structure  interfacing  much  simpler 
than  trying  to  keep  three  versions  of  the  code  up-to- 


date.  Moreover,  the  user  interface  is  identical  across 
all  the  machines  the  code  runs  on,  which  has  been  an 
important  factor  in  getting  scientist  interested  in  ex¬ 
ecuting  this  code  in  a  parallel  environment.  The  par¬ 
allelism  is  hidden  from  the  user.  Even  operations  like 
getting  a  number  of  processors  and  loading  programs 
onto  these  processors  is  done  automatically  by  the 
code.  If  the  user  wishes  to  increase  the  parallelism, 
the  input  file  contains  the  number  of  processors  the 
computational  experiment  will  use. 

In  the  next  section,  we  describe  the  KKR-CPA  ap¬ 
proach  and  how  it  has  been  parallelized  for  the  In¬ 
tel  iPSC/860.  In  the  last  section,  we  present  perfor¬ 
mance  results  comparing  our  implementation  of  this 
algorithm  on  several  computers,  and  we  present  re¬ 
sults  from  two  scientific  studies  of  the  effect  of  alloy¬ 
ing  in  perovskite  superconductors. 

2  KKR-CPA  Algorithm 

Figure  1  shows  a  general  schematic  of  how  we  orga¬ 
nized  the  consolidated  KKR-CPA  code.  Organizing 
the  code  in  this  way  required  only  a  few  additional 
routines  to  be  written.  None  of  the  additional  rou¬ 
tines  involved  calculations,  so  exactly  the  same  com¬ 
putational  routines  are  called  in  the  serial  and  parallel 
versions. 

We  opted  to  use  a  master /slave  paradigm  in  our 
parallel  implementation.  In  this  scheme  one  proces¬ 
sor  controls  work  on  the  entire  problem,  and  the  rest 
of  the  processors  perform  work  requested  by  this  mas¬ 
ter  process.  The  master  process  in  our  implementa¬ 
tion  is  called  the  pseudo-host  and  executes  on  one  of 
the  iPSC/860  nodes.  We  avoided  using  the  iPSC/860 
host  as  the  master  process  because  of  the  computa¬ 
tional  imbalance  between  this  80386  based  processor 
and  the  more  powerful  i860  based  node.  The  host  is 
also  over  burdened  with  executing  the  Unix  operating 
system. 

The  KKR-CPA  algorithm  is  organized  in  the  fol¬ 
lowing  way.  We  start  by  inputing  the  atomic  num¬ 
bers  of  the  species  and  an  initial  guess  for  the  charge 
density  and  potentials. 

Since  the  Green’s  function  for  the  system  at  any 
energy  is  independent  of  any  other  energy,  this  is  a 
natural  point  in  the  algorithm  for  parallelism.  In  the 
parallel  implementation,  the  energies  to  be  evaluated 
are  held  in  a  queue  of  tasks.  The  difficulty  of  each 
task  is  initially  unknown,  so  a  heuristic  is  used  to  or¬ 
der  the  queue  in  approximately  decreasing  difficulty. 
Each  idle  processor  selects  the  next  task  in  the  queue 
and  returns  the  density  of  states  to  the  master  pro¬ 
cess,  which  computes  the  integral  over  all  energies. 
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This  integrated  density  of  states  is  used  to  obtain  the 
Fermi  level,  which  is  the  highest  state  occupied  by  an 
electron. 

Load  balancing  is  achieved  naturally  since  all  the 
processors  will  remain  busy  as  long  as  there  are  tasks 
left  in  the  queue.  Each  task  in  the  queue  performs 
the  following  operations. 

It  solves  the  one-electron  Schrodinger  equation  for 
a  single,  spherically  symmetric  muffin-tin  potential  to 
obtain  the  wave  functions  and  the  scattering  phase 
shift.  The  phase  shift  is  used  to  construct  the  single 
site  transfer  matrix  t,  which  depends  only  on  energy 
and  is  used  in  setting  up  the  KKR  matrix. 

The  systems  we  are  considering  are  periodic  in 
space,  so  we  work  in  reciprocal  space  by  applying  3D 
Fourier  transforms.  A  Wigner-Seitz  cell  in  reciprocal 
space  is  called  a  Brillouin  zone.  Since  everything  is 
periodic  in  reciprocal  space,  we  do  all  our  CPA  cal¬ 
culations  within  the  first  Brillouin  zone. 

The  CPA  iteration  calculates  the  coherent  single 
site  transfer  matrix  tc  and  the  scattering  path  opersr 
tor  T  for  the  disordered  system.  Initially,  tc  is  approx¬ 
imated  by  the  average  t  matrix  approximation,  which 
is  the  concentrated  weighted  average  of  the  alloying 
components.  The  next  two  steps  of  the  CPA  iter¬ 
ation  are  the  most  computationally  intensive  of  our 
approach.  The  processor  must  form  the  KKR  matrix 
and  then  integrate  its  inverse  over  the  first  Brillouin 
zone.  If  there  is  symmetry  within  the  Brillouin  zone 
this  can  be  exploited  to  decrease  computation.  For 
example,  materials  with  cubic  symmetry  require  that 
only  1/48  of  the  Brillouin  zone  be  integrated. 

To  form  the  KKR  matrix  —  G),  it  first  calcu¬ 
lates  the  “structure  constants”  matrix  G.  In  general, 
the  calculation  of  G  is  very  difficult,  but  this  algo¬ 
rithm  has  been  made  more  efficient  by  using  special 
polynomial  fitting  technique  to  evaluate  G.  A  de¬ 
scription  of  this  method  of  calculating  structure  con¬ 
stants  can  be  found  in  [8]. 

One  problem  in  inverting  the  KKR  matrix  is  it  will 
be  singular  in  certain  regions,  and  it  is  these  singu¬ 
larities  that  determine  the  energy  bands.  The  KKR 
method  can  be  analytically  continued  into  the  com¬ 
plex  energy  plane.  By  performing  these  calculations 
in  the  complex  energy  plane  these  singularities  obtain 
a  Lorentzian  broadening  and  the  amount  of  broad¬ 
ening  is  proportional  to  the  imaginary  part  of  the 
energy.  In  addition,  due  to  the  sensitivity  of  the  cal¬ 
culation,  double  precision  complex  arithmetic  is  used. 
To  evaluate  the  integral,  hundreds  or  possibly  thou¬ 
sands  of  complex  double  precision  matrices  of  order 
between  80  and  300  must  be  formed  and  inverted. 
Each  matrix  corresponds  to  a  different  vertex  of  the 
tetrahedrons  into  which  the  Brillouin  zone  has  been 


subdivided.  The  result  of  the  tetrahedral  integration 
is  the  scattering  path  operator  r. 

This  T  is  inserted  into  the  Coherent  Potential  Ap¬ 
proximation  (CPA)  equations  to  calculate  the  next 
approximation  to  tc- 

Once  r  and  te  have  converged,  the  Green’s  func¬ 
tion  for  the  system  is  calculated  by  combining  r  and 
the  wave  function  solutions  to  the  single  scatterer 
Schrodinger  equation.  The  DOS  for  this  energy  is 
the  imaginary  part  of  the  integration  of  the  Green’s 
function  over  the  Wigner-Seitz  cell. 

Self-consistency  of  the  charge  density  will  soon  be 
incorporated  into  the  KKR-CPA  code.  This  outer  it¬ 
eration  involves  integrating  the  Green’s  function  over 
energy  to  get  the  charge  density,  which  is  used  to 
obtain  the  potential  for  the  next  iteration.  Thus  the 
entire  process  described  so  far  may  be  iterated  several 
times  in  the  self-consistent  version  of  the  code.  In  the 
parallel  implementation  this  will  involve  the  pseudo¬ 
host  integrating  the  density  of  states  it  receives  from 
the  nodes  over  all  the  energies.  A  future  paper  will 
describe  this  work. 

3  Results 

The  code  has  been  written  so  that  it  executes  on  se¬ 
rial  computers  such  as  workstations  or  CRAYs  as  well 
as  on  parallel  computers  such  as  the  Intel  iPSC/860. 
The  code  requires  that  a  minimum  of  3  Mbytes  of 
memory  be  available,  and  for  the  more  complicated 
materials  up  to  8  Mbytes  of  memory  is  required  by  in¬ 
dividual  processors.  Work  is  underway  to  reduce  the 
memory  requirements  for  the  complicated  materials. 

The  Intel  iPSC/860  multiprocessor  at  ORNL  has 
128  RX  nodes  and  4  I/O  nodes.  The  I/O  nodes  con¬ 
nect  the  RX  nodes  to  the  Concurrent  File  System 
(CFS),  which  has  5.2  Gbytes  of  disk  storage.  Each 
of  the  RX  nodes  contains  a  40  Mhz  Intel  i860  RISC 
processor  and  8  Mbytes  of  memory.  The  i860  has  a 
2  Kbyte  on-chip  cache  and  a  claimed  peak  rate  of  80 
Mflops  (single  precision).  While  our  hand  coded  as¬ 
sembly  language  BLAS  routines  execute  on  one  pro¬ 
cessor  at  19  -  55  Mflops,  these  rates  are  not  obtained 
inside  an  application  because  of  memory  access  de¬ 
lays.  For  example,  in  the  KKR-CPA  code  we  use  the 
BLAS  routine  ZAXPY,  which  executes  at  approxi¬ 
mately  18  Mflops  inside  the  application. 

The  high  Tc  perovskite  superconductors 
Bai-xKxBiOa  and  BaPbi.xBixOa  with  critical  tem¬ 
peratures  of  30  K  and  13  K  respectively  are  attrac¬ 
tive  systems  on  which  to  begin  a  systematic  study 
of  high  temperature  superconductivity  because  their 
relatively  simple  structure  (cubic)  allows  a  more  thor- 
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ough  treatment  of  their  electronic  structure.  It  is  be¬ 
lieved  that  the  superconducting  state  of  these  mate¬ 
rials  can  be  understood  by  studying  their  electronic 
structure  in  their  normal  state.  Even  if  the  mech¬ 
anism  for  superconductivity  is  different  in  the  non¬ 
cuprate  superconductors,  because  of  their  high  tran¬ 
sition  temperature  the  electron-phonon  coupling  con¬ 
stant  would  have  to  be  extremely  large  and  a  strong 
coupling  of  this  magnitude  would  be  a  very  interest¬ 
ing  phenomenon. 

The  code  has  been  run  successfully  on  several  com¬ 
puters  using  a  test  problem  involving  the  high  tem¬ 
perature  superconductor  (Ba5K,5)Bi03.  The  test 
problem  required  the  calculation  of  the  density  of 
states  for  a  fixed  number  of  representative  energies 
without  iterating  to  self-consistency.  The  average 
Mfiop  rate  for  5  computers  is  shown  in  Figure  2. 

Only  one  processor  on  the  CRAY  2  and  CRAY 
YMP  is  used.  The  130  Mflops  shown  in  the  table 
is  achieved  by  modifying  severail  routines  in  the  basic 
code  to  further  enhance  vectorization.  The  inversion 
routines  had  already  been  vectorized,  but  the  routines 
to  form  the  KKR  matrix  had  not  been  vectorized  in 
the  basic  code. 

The  rate  shown  for  the  iPSC/860  includes  the  time 
to  load  the  problem  onto  128  processors,  all  commu¬ 
nication,  file  I/O  (four  fairly  large  output  files  are 
generated),  and  dynamic  load  balancing  overhead. 
The  rate  of  660  Mflops  corresponds  to  compiled  FOR¬ 
TRAN  on  a  machine  running  at  32  Mhz.  This  rate 
was  increased  to  1.3  Gflops  by  using  an  assembly  lan¬ 
guage  BLAS  routine  ZAXPY  in  the  ’"version  routine. 
When  the  iPSC/860  was  upgraded  to  40  Mhz,  the 
ZAXPY  version  of  the  superconductor  code  executed 
at  an  aggregate  rate  of  1.6  Gflops  on  128  processors. 

The  calculation  of  electronic  states  of  alloys  over 
a  large  energy  spectrum  is  not  feasible  on  most  of 
the  computers  listed  in  Figure  2.  But  these  calcu¬ 
lations  have  been  performed  on  the  Intel.  The  first 
research  question  we  asked  was;  What  are  the  effects 
of  alloying  on  the  density  of  states  for  the  two  per- 
ovskite  superconducting  compounds  Ba^.xKxBiOs 
and  BaPbi-xBixOa? 

The  rigid-band  approximation  has  been  used  previ¬ 
ously  to  study  the  effects  of  disorder  in  the  perovskite 
superconductors  [9,10].  In  the  rigid-band  approxima¬ 
tion  it  is  assumed  that  the  difference  in  the  phase 
shifts  of  the  alloying  components  is  negligible  and 
therefore,  the  effect  of  alloying  is  to  rigidly  shift  the 
Fermi-energy  up  or  down  depending  on  whether  the 
alloying  component’s  atomic  number  is  greater  than 
or  less  than  the  original  component’s  atomic  num¬ 
ber.  The  rigid-band  approximation  is  valid  only  in 
the  weak  scattering  limit.  It  is  the  purpose  of  these 


calculations  to  test  the  appropriateness  of  this  as¬ 
sumption. 

To  study  the  effects  of  disorder  the  density  of  states 
(DOS)  of  the  order  materials  (x=0  and  x=l)  are  cal¬ 
culated  and  compared  to  the  disordered  alloy  (x=.5). 
At  the  top  and  bottom  of  Figure  3  are  the  DOS  of  the 
ordered  compounds  BaBiOa  and  KBiOs  respectively 
and  the  disordered  alloy  Ba.sK.sBiOs  is  in  the  middle. 
The  DOS  of  these  materials  near  the  Fermi-energy 
(Ef  =  0.0  Ry.)  is  dominated  by  Bi-0  states.  Com¬ 
paring  these  states  we  can  see  that  the  Bi-0  states  in 
the  alloy  have  been  slightly  broadened  by  the  disor¬ 
der  on  the  Ba-K  sublattice.  The  broadening  is  small 
because  Ba-K  are  on  a  different  sublattice  and  there¬ 
fore,  this  is  a  second  order  effect.  We  also  found  that 
the  variation  of  Ey  versus  concentration  in  the  CPA 
agrees  with  the  rigid  band  approximation.  Therefore, 
because  of  the  small  broadening  of  the  DOS  and  the 
agreement  of  the  variation  of  Ey  with  concentration, 
we  conclude  that  the  use  of  the  rigid-band  approxi¬ 
mation  is  valid  for  this  material. 

Similarly,  Figure  4  displays  the  results  of  alloying 
with  Lead,  but  here  the  alloying  is  on  the  Bi  sublattice 
rather  than  on  the  Ba  sublattice.  At  -.60  Ry  and  -.50 
Ry  in  the  alloy  are  the  Bi-6s  and  Pb-6s  states  respec¬ 
tively  and  these  show  significant  disorder.  But  these 
states  are  far  from  Ey  and  are  unimportant.  The 
states  near  Ey  are  Bi-0  and  Pb-0  and  these  show 
almost  no  broadening.  Even  though  the  DOS  show 
very  little  disorder,  the  variation  of  Ey  with  concen¬ 
tration  does  not  satisfy  the  rigid-band  approximation 
and  this  is  the  most  stringent  criteria  that  must  be 
satisfied.  Therefore,  we  conclude  that  the  rigid-band 
approximation  used  previously  by  other  authors  to 
study  the  effects  of  alloying,  is  not  valid  for  this  sys¬ 
tem. 

The  iPSC/860  required  about  one  hour  to  gener¬ 
ate  the  data  used  in  each  of  these  figures.  The  results 
show  that  the  superconductivity  is  affected  in  differ¬ 
ent  ways  by  each  of  these  alloys.  The  alloying  with 
Potassium  leaves  the  band  structure  essentially  un¬ 
changed  but  decreases  the  Fermi  energy  On  the  other 
hand,  alloying  with  Lead  causes  a  softening  of  the 
band  structure. 

The  use  of  parallel  computation  and  the  iPSC/860 
has  led  to  over  an  order  of  magnitude  improvement 
in  computational  speed  compared  to  the  CRAY  su¬ 
percomputers  for  our  KKR-CPA  code.  From  a  re¬ 
search  standpoint  the  turnaround  time  for  computa¬ 
tional  experiments  is  closer  to  two  orders  of  magni¬ 
tude.  This  greater  computational  power  allows  us  to 
begin  investigation  of  many  unanswered  questions  in 
superconductivity  and  material  science. 
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Figure  1;  Schematic  of  parallel  implementation  of  KKR-CPA  code. 


Mflops 


Figure  2:  Performance  in  Mflops  of  KKR-CPA  Code  on  various  computers.(a)  extra  vec- 
torization  employed .(b)  fortran  only,  (c)  using  assembly  zaxpy. 
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Figure  3:  Affects  of  alloying  on  the  density  of  states  for  Ba\-xKiBiOz. 
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Figure  4:  Effects  of  alleging  on  the  density  of  states  for  BaBi\-xPbgO^. 
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Abstract 

The  central  problem  of  single  crystal  molecular  structure 
determination  via  X-ray  diffraction  is  the  “phase  prob¬ 
lem.”  Associated  with  each  diffraction  maximum  (a  re¬ 
flection)  are  a  magnitude,  which  can  be  experimentally  de¬ 
termined,  and  a  phase  angle,  which  is  lost  in  the  experi¬ 
ment.  The  goal  of  “direct  methods”  is  to  mathematically 
reconstruct  the  phase  information  from  the  magnitude  in¬ 
formation  alone. 

Traditional  direct  methods  are  capable  of  determining 
structures  of  moderate  complexity,  but  to  extend  them  to 
problems  of  the  size  of  macromolecules  (proteins,  etc.)  re¬ 
quires  developing  new  techniques  that  appear  to  be  com¬ 
putationally  intensive.  Recently,  a  new  formulation  of  the 
phasing  process,  dependent  on  a  minimal  function,  has 
been  proposed.  Here  we  explore  a  number  of  different 
implementations  of  the  principle  to  the  solution  of  small 
molecular  structures.  The  machines  that  we  use  include 


an  Intel  iPSC/2  hypercube,  the  Connection  Machine  CM- 
2,  and  a  network  of  Sun  workstations. 

1  Introduction 

A  mainstay  of  modern  structural  chemistry  is  the  sin¬ 
gle  crystal  X-ray  diffraction  technique  of  structure  de¬ 
termination.  This  technique  provides  a  three  dimen¬ 
sional  mapping  of  the  positions  of  atoms  in  crystals, 
thereby  securing  unambiguous  information  about  the 
architecture  of  molecules.  The  technique  is  robust  in 
the  sense  that  solids  as  diverse  as  silicon  and  virus 
crystals  can  be,  and  are,  subjects  fit  for  study.  The 
three  stages  of  an  X-ray  diffraction  experiment  are 

1.  the  growth  of  suitable  single  crystals  of  the  sub¬ 
stance  to  be  studied. 
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2.  the  measurement  of  X-ray  diffraction  data,  and 

3.  the  unraveling  of  the  molecular  structure  that 
agrees  with  the  diffraction  data. 

The  last  step  is  frequently  computationally  intensive. 

In  the  experiment,  a  beam  of  X-rays  of  well  de¬ 
fined  wavelength,  say  1  Angstrom,  is  trained  on  the 
crystal.  The  crystal  is  oriented  so  that  an  individual 
diffracting  plane  is  brought  into  the  Bragg  condition 
and  diffracted  photons  are  counted  either  electroni¬ 
cally  or  recorded  photographically.  The  process  is  re¬ 
peated  anywhere  from  a  few  hundred  to  a  few  million 
times,  depending  on  the  size  of  the  structure  to  be  de¬ 
termined,  as  individual  diffracting  planes  are  brought 
into  the  Bragg  condition.  Each  condition  for  diffrac¬ 
tion,  called  a  reflection,  is  characterized  by  a  location 
on  a  three-dimensional  grid,  or  reciprocal  lattice,  cor¬ 
responding  to  the  orientation  of  the  crystal  and  the 
angle  which  the  diffracting  plane  makes  with  the  in¬ 
coming  X-ray  beam.  As  the  grid  constitutes  a  true 
lattice,  each  reflection  can  be  labeled  by  three  inte¬ 
gers,  the  Miller  indices,  that  denote  the  location  of 
the  reflection  on  the  reciprocal  lattice  relative  to  a 
common  origin.  Each  reflection  is  additionally  char¬ 
acterized  by  a  diffraction  intensity  related  simply  to 
the  number  of  counts  recorded  electronically  or  to  the 
blackness  of  film  recorded  photographically.  The  in¬ 
tensity  is  related  to  the  efficiency  with  which  a  Bragg 
plane  diffracts  X-rays.  As  electrons  are  the  media  that 
diffract  X-rays,  and  atoms  are  made  up  of  electrons 
centered  about  their  nuclei,  the  intensity  of  an  indi¬ 
vidual  reflection  is  related  to  the  density  of  electrons 
in  the  near  vicinity  of  the  Bragg  plane.  The  mathemat¬ 
ics  that  relates  the  underlying  atomic  arrangement  in 
a  crystal  to  the  intensities  and  locations  of  the  Bragg 
reflections  is  a  three-dimensional  Fourier  transforma¬ 
tion. 

It  would  seem  that  all  the  tools  necessary  to  unravel 
the  structure  of  molecules  in  crystals  are  assembled 
once  the  diffraction  experiment  is  concluded.  Nature, 
however,  has  its  own  agenda.  Missing,  and  presumably 
lost,  in  the  experiment  are  the  phases  of  the  Fourier 
coefficients  relative  to  a  common  reciprocal  lattice  ori¬ 
gin.  That  is,  the  experiment  yields  the  amplitudes  and 
orientations  of  the  Fourier  components  but  not  their 
phases.  What  nature  conceals,  the  direct  methods  of 
structure  determination  seek  to  supply. 

2  The  Phase  Problem 

It  is  the  determination  of  the  set  of  phases,  one  for 
each  reflection,  that  constitutes  the  phase  problem. 
Early  analyses  of  the  problem  led  some  to  believe  that 


the  problem  was  in  principle  unsolvable.  An  infinity 
of  Fourier  transformation  maps  could  be  had  that  fit 
the  experimental  results;  they  would  differ  only  in  the 
set  of  phases  used  to  reconstruct  the  atomic  arrange¬ 
ment.  On  the  other  hand,  common  sense  held  that 
since  a  small  number  of  structural  arrangements  had 
been  ascertained  by  a  trial  and  error  method  there 
must  be  a  solution  to  the  phase  problem.  Two  phys¬ 
ical  constraints  make  the  problem  not  only  solvable 
but  in  principle  greatly  overdetermined.  One  is  the 
hard  constraint  that  for  a  Fourier  transformation  to 
be  physically  meaningful  it  must  lead  to  a  map  in 
which  the  calculated  electron  density  (electrons  per  cu¬ 
bic  Angstrom)  is  everywhere  non-negative.  The  other 
is  a  softer  constraint  that  electron  density  about  atoms 
in  molecules  (whether  in  crystals  or  in  the  gas  phtise) 
is  strongly  concentrated  about  the  atomic  centers  (the 
nuclei).  “Non-negativity”  and  “atomicity”  were  two 
important  principles  in  the  earliest  formulations  of  di¬ 
rect  methods. 

In  a  direct  methods  attack  on  the  phase  problem, 
probabilistic  theories  are  used  to  relate  the  phases,  or 
more  preci.sely  certain  linear  relationships  among  the 
phases,  to  the  measured  intensity  data.  For  example, 
it  can  be  shown  that  the  sum  of  three  phases 

where  H  and  K  are  reciprocal  vectors  with  distinct 
Miller  indices,  e.g.,  H  =  {3,1,2},  K  =  {—4,3,— 8}, 
and  — H  —  K  =  {1,— 4,6),  has  a  most  probable  value 
of  0  mod  23r  radians  and  that  probability  increases  as 
the  product  of  the  magnitudes  of  the  intensities  of 
the  reflections  {3,1,2},  {—4,3,— 8}  and  {!,— 4,6}  in¬ 
creases.  .Such  a  relationship  among  three  phases  is 
called,  in  the  trade,  a  “triple”  relationship.  Analo¬ 
gously,  a  “quartet”  of  the  form 

where  the  main  terms  L,  M,  N,  and  — L  —  M  —  N  are 
associated  with  large  intenr'ties  and  the  cross-terms 
L  -f-  M,  M  -(-  N,  and  N  -I-  L  are  associated  with  small 
intensities  has  a  most  probable  value  of  ir  mod  2x  and 
that  probability  increases  as  the  main  terms  become 
larger  and/or  the  cros.s-terms  become  smaller.  With 
these  tools  in  hand  a  number  of  strategics  evolved  to 
“solve  the  phase  problent”  for  small  to  moderate  sized 
(up  to  315  atoms  at  last  count)  structures.  It  is  the 
extension  of  these  methods  to  larger  structures  that 
we  direct  our  attention. 

3  The  Minimal  Function 

As  structures  become  larger,  estimates  for  the  phase 
sums  (the  “triples”  and  “quartets”)  become  increas- 
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ingly  less  reliable.  Consequently,  a  direct  attack  that 
is  promising  for  small  structures  becomes  untenable 
for  large  structures.  On  the  other  hand,  whereas  the 
number  of  reflections  grows  more  or  less  linearly  with 
the  size  of  the  structures,  the  number  of  phase  sums 
explodes  catastrophically.  A  global  procedure  to  make 
the  sheer  numbers  of  phase  sums  work  for  the  crystal- 
lographer  exploits  the  probabilistic  nature  of  the  esti¬ 
mates  of  the  phase  sums  in  the  following  way.  As  the 
structure  gets  ever  larger  the  estimate  for  any  individ¬ 
ual  triple  or  quartet  becomes  less  and  less  reliable,  but 
averaged  over  the  ever  increasing  number  of  such  phase 
relationships,  the  estimates  as  a  whole  get  better.  A 
breakthrough  in  the  use  of  the  estimates  came  when  it 
was  realized  that  a  particularly  simple  function  of  the 
phases,  defined  below,  is  a  minimum  when  the  correct 
set  of  phases  is  used  to  compute  the  function.  It  was 
quickly  realized  that  the  conceptual  problem  of  phas¬ 
ing  procedure  was  replaced  by  one  of  computational 
strategy. 


3.1  Theory 

We  assume  a  crystal  structure  S  in  the  space  group  G 
to  be  fixed,  but  unknown.  The  normalized  structure 
factor  magnitudes  |E|  are  also  assumed  to  be  known. 
The  function  to  be  minimized  is  defined  initially  as  a 
function,  R(I),  of  the  structure  invariants: 


H  Ahk  I'osThk  - 

H,K  '• 


Mwr . 


L.M.N  '• 


^o(^LMN)/  j 


(1) 


where 


H.K  L,M,N 

It  should  be  noted  that  nega¬ 

tive  values  (see  below)  when  the  cross-terms  are  very 
small,  so  it  sums  into  the  denominator  D  as  its  abso¬ 
lute  value. 


^HK  =  (2) 

is  the  triplet, 

‘^LMN  = 


is  the  quartet, 

^HK  =  ('I) 

%MN  =  ;^I^l^m%^l+m+nI  (i^l+mP 

+  I^M+n1^  +  I^N-i-L1^-2)  ,  (5) 

N  is  the  number  of  atoms,  assumed  identical,  in  the 
unit  cell,  and  I\  and  [q  are  the  Modified  Bessel  Func¬ 
tions.  In  view  of  Eq  (2)  and  (3),  Eq.  (1)  also  defines 
Ras  A  function,  R(^'  of  the  phases  <i>.  Since  the  mag¬ 
nitudes  |E|  arc  ,  ^umed  to  be  known,  the  functions 
R{1)  and  R{^)  are  known. 

Next,  the  phases  4>  are  themselves  functions,  for 
fixed  choice  of  origin,  of  structures  T, 

1  ^ 

%  =  I%I«p(>>h)  = 

j=i 

where  Tj  is  the  position  vector  of  the  atom  labeled 
j.  Since  the  structure  invariants  Tjjj^  and  Qlmisj 
are  uniquely  determined  by  the  structure  T,  indepen¬ 
dently  of  the  choice  of  origin,  it  follows  that  Eq.  1  also 
defines  a  function,  R(T),  of  structures  T.  Further¬ 
more,  since  the  magnitude  of  any  structure  invariant 
is  the  same  for  T  and  its  enantiomorph,  but  has  op¬ 
posite  signs  for  the  enantiomorphs,  and  since  only  the 
cosines  of  the  structure  invariants  appear  in  Eq.  (1),  it 
follows  that  R  has  the  same  value  for  T  and  its  enan¬ 
tiomorph. 

The  minimal  principle  states  that 


R{S)  <  R(T)  if  T^S.  (7) 

Since,  in  general,  the  number  of  phases  exceeds 
by  far  the  number  of  independent  atomic  coordinates 
needed  to  fix  the  crystal  structure,  a  large  number  of 
identities  must  be  satisfied  by  the  phases.  Thus  the 
phases  are  not  independent  variables  and  the  mini¬ 
mum  of  /?(4>),  regarded  now  as  a  function  of  indepen¬ 
dent  phases,  will  in  general  be  less  than  R(S)  but  will 
yield  values  of  the  phases  somewhat  different  from  the 
true  phases  corresponding  to  the  structure  5.  What 
is  needed  then  is  the  global  minimum  of  R(^)  subject 
to  the  constraint  that  all  identities  among  the  phases, 
which  must  of  necessity  be  fulfilled,  are  in  fact  satis¬ 
fied. 

One  may  go  back  to  Eq.  (1)  and  observe  that,  since 
the  number  of  structure  invariants  exceeds  by  far  the 
number  of  phases,  and  since  the  phases  themselves  are 
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not  independent,  a  large  number  of  identities  among 
the  structure  invariants  and  Qlmn  must  also 

be  satisfied.  Thus  the  minimal  principle  may  be  re¬ 
formulated  as  follows;  Among  all  structure  invariants 
and  which  satisfy  the  necessary  identi¬ 

ties,  those  invariants  which  correspond  to  the  struc¬ 
ture  S  minimize  the  function  R(T).  Alternatively, 
among  all  phases  if>  which  satisfy  the  necessary  identi¬ 
ties,  those  corresponding  to  the  true  structure  S  mini¬ 
mize  /2($);  or  finally,  the  {N  atom)  structure  T  which 
minimizes  R(T)  coincides  with  5. 

4  Computational  Initialization 

4.1  Invariant  Generation 

The  triplets  and  quartets  which  serve  as  input  to  our 
program  are  defined  as  follows.  Suppose  we  are  given  n 
sets  of  Miller  indices,  Mi...Mn,  where  each  set  consists 
of  three  integers  (x,y,  z)  which  refer  to  the  location 
of  the  reflection  on  the  reciprocal  lattice  relative  to  a 
common  origin.  Associated  with  each  Miller  index  Mj, 
1  <  «  <  n,  is  a  diffraction  intensity  \Ei\.  Let  a  triplet 
t  =  {h,k,l)  refer  to  the  and  sets  of  Miller 

indices  such  that  Mj,  +  Mfc  -1-  Mj  =  0.  Similarly,  let 
a  quartet  be  defined  as  Q  =  (h,k,l,m),  where  Mh  -b 
M*  -b  M/  -b  Mm  =  0 

For  the  molecular  structure  that  we  are  currently 
working  with,  we  generate  the  triplets  and  quau-tets  as 
follows.  Sort  the  sets  of  Miller  indices  that  were  ex- 
perimently  determined  into  decreasing  order  by  their 
intensity  values  and  select  the  top  n  sets,  where  n 
refers  to  the  number  of  phases  to  be  determined.  For 
each  Miller  index  M*,  consider  M*  =  {x,y,z),  where 
1  <  /»  <  ib  <  n,  for  all  permutations  of  {±x,±y,±z). 
Determine  if  there  exists  a  Miller  index  M(,  k  <  I  <n, 
such  that  Mj,  or  any  permutation  of  Mj,  is  equal  to 
— M*  —Mfc.  Such  an  M/  may  not  exist  if  its  associated 
intensity  value,  \Ei\,  is  too  small.  If  such  an  Mj  does 
exist  then  the  triplet  t  =  (h,  k,  /)  is  formed.  See  Figure 
1.  A  similar  proces.s  is  followed  for  the  quartets. 

4.2  Computing  R 

Using  the  above  definitions  of  triplets  and  quartets  we 
can  define  <i>t  as 

<t>t  =  =  *(>h  +  <f>k  +  <f>i, 

where  <  is  a  triplet  equal  to  {h,k,l).  Using  this  nota¬ 
tion  the  calculation  of  R  can  be  described  as  follows; 

„  Z)t  W't(cos  -  h)^  +  Wq{cos  <f)Q  ~  IqY 
Et  +  Eq 
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Figure  1;  Example  Formation  of  a  Triplet 


where 

j  IMsmI 
^oMhk)’ 

T  ^  ^i(%MN) 

^o(Blmn)’ 

Wt  =  Ahk- 
=  I^lmnI- 

I  is  the  (known)  expectation  value  of  the  cosine  of  the 
corresponding  structure  invariant  averaged  over  the 
conditional  probability  distribution  of  triplets,  W  is 
a  weight  factor  inversely  proportional  to  the  variance, 
A  and  B  are  as  defined  previously,  and  T  and  Q  rep¬ 
resent  the  set  of  triplets  and  quartets,  respectively. 

Notice  that  each  term  of  Et  Ho  com¬ 

puted  independently.  Thus,  R  can  be  efficiently  com¬ 
puted  in  parallel  by  computing  partial  sums  and  then 
combining  them  with  a  global  sum  operation.  The 
denominator,  Et  "*■  Hq  ®  constant  which 

is  computed  during  the  initialization  of  the  data  set. 
This  initialization,  which  is  performed  once  per  data 
set,  also  calculates  the  constants  Wt,  Wq,  It  and  1q. 

5  Simulated  Annealing 

Our  initial  search  for  the  global  minimum  focused  on 
the  use  of  simulated  annealing[3}.  Simulated  annealing 
is  a  probabilistic  optimization  technique  designed  to 
escape  local  minima  in  search  of  the  global  minimum. 
Since  preliminary  studies  indicated  that  the  minimal 
function  contained  a  vast  number  of  local  minima,  the 
use  of  simulated  annealing  seemed  well  suited  to  our 
needs. 

Given  an  optimization  function  /(ar),  where  i  is  a 
configuration  of  the  optimization  problem,  simulated 
annealing  begins  by  first  choosing  a  random  config¬ 
uration  Co  of  the  optimization  problem,  and  an  ini¬ 
tial  cooling  parameter  c.  At  iteration  t  of  the  algo- 


Procedure  Simulated  Annealing; 

{ 

Initialize(c,  present.config); 

for  (coolingjstep  =  1  to  Number.CoolingJSteps) 

{ 

for  (chain  =  1  to  Markov .Chain.Length) 

{ 

Perturb(present.config,new.config); 

Delta  =  Cost(new.config)  —  Cost(present_config); 
if  (Delta  <  0) 

present.config  =  new.config;  /*  accept  */ 
else  if  (exp(-Delta/c))  >  (random()) 
present.config  =  new.config;  j*  accept  */ 
else;  /*  reject  new.configuration  */ 

} 

c  =  c*a] 

} 

} 


Figure  2;  Simulated  Annealing  Algorithm 

rithm,  configuration  Ci  is  perturbed  to  produce  con¬ 
figuration  C,+i  such  that  Cj+i  lies  within  the  neigh¬ 
borhood  of  Ci.  Let  A  =  /(Ci+i)  -  f{Ci).  If  (A  <  0) 
or  (e“  >  random(0, 1)),  then  configuration  Ci+i  is 
accepted  as  the  next  configuration,  otherwise  configu¬ 
ration  Ci  is  used  as  the  next  configuration.  Notice  that 
in  order  to  allow  the  optimization  function  to  escape 
a  local  minimum,  configuration  Ci+i  may  be  accepted 
even  though  it  has  a  higher  cost  than  Ci.  This  process 
is  continued  for  a  number  of  iterations  (the  length  of 
the  Markov  Chain)  before  c  is  decremented  by  multi¬ 
plying  it  by  a  parameter  a,  where  a  <  1.  Thus,  as  the 
algorithm  progresses  it  becomes  more  difficult  to  climb 
out  of  a  minimum.  Hopefully,  this  allows  the  function 
to  eventually  settle  into  the  global  minimum. 

As  can  be  seen  from  Figure  2,  the  process  of  simu¬ 
lated  annealing  involves  the  following  parameters;  the 
colling  rate  (a),  the  Markov  chain  length,  the  num¬ 
ber  of  cooling  steps,  and  the  amount  of  perturbation. 
These  parameters  form  what  is  commonly  called  a  per¬ 
turbation  scheme  [1]. 

Our  first  attempt  to  minimize  R  focused  on  ex¬ 
ploring  the  reciprocal  space,  which  is  commonly  called 
phase  space.  A  perturbation  scheme  requires  us  to  de¬ 
termine  the  following: 

•  The  number  of  phases  to  be  perturbed  at  each 
iteration. 

•  The  amount  to  perturb  each  phase. 


•  The  length  of  the  Markov  Chain. 

•  The  number  of  cooling  steps. 

•  The  rate  of  cooling,  a. 

We  implemented  a  wide  variety  of  such  perturbation 
schemes,  as  described  below. 

In  selecting  the  number  of  phases  to  perturb  we  con¬ 
sidered  choosing  both  a  random  number  and  a  con¬ 
stant  number  of  phases.  Since  the  amount  of  pertur¬ 
bation  is  to  be  chosen  such  that  the  next  configuration 
is  within  a  neighborhood  of  the  previous  configuration, 
and  since  we  had  limited  knowledge  of  the  function,  we 
allowed  the  amount  of  the  perturbations  to  range  from 
0  to  2x  radians.  That  is,  the  amount  of  perturbation, 
r,  for  a  given  perturbation  scheme  was  chosen  such 
that  0  <  r  <  2jr. 

We  considered  both  a  constant  and  a  conditional 
number  of  cooling  steps.  When  implementing  the 
conditional  cooling  scheme,  termination  of  the  cool¬ 
ing  loop  occurred  when  the  present  configuration  (ie., 
the  set  of  phases)  remained  unchanged  for  10  cooling 
steps.  A  variety  of  constant  Markov  chain  lengths  were 
considered  based  on  experimentation.  We  allowed  a, 
the  rate  of  cooling,  to  vary  between  .8  and  .99,  as  the 
length  of  the  Markov  chain  varied. 

The  most  promising  results  we  obtained  were  with 
a  very  slow  cooling  rate,  o  =  .99,  combined  with  a 
small  Markov  chain  length,  and  a  small  perturbation 
amount  for  a  random  subset  of  the  phases. 

We  consider  this  process  successful  if  the  set  of 
phases  produced  are  within  30°  to  40°  of  the  true  set. 
Unfortunately,  simulated  annealing  in  phase  space  was 
not  producing  such  results.  Our  next  approach  was  to 
use  simulated  annealing  in  atom  space.  Our  rational 
for  this  is  that  working  in  atom  space  allows  us  to 
impart  chemical  knowledge  of  the  structure,  such  as 
restricting  the  distance  between  two  atoms  to  be  no 
closer  than  1.2A,  on  the  problem.  In  addition,  the 
number  of  variables  for  the  minimal  function  is  re¬ 
duced  from  approximately  lOA  to  N,  where  N  is  the 
total  number  of  atoms  in  the  structure  and  ION  is  the 
approximate  number  of  phases  being  considered. 

In  order  to  use  atom  space  we  must  transform  the 
atomic  coordinates  to  a  set  of  phases,  since  the  phases 
are  needed  to  calculate  R.  This  calculation  is  called 
the  structure  factor  calculation  and  is  computationally 
expensive.  The  structure  factor  for  each  phase  p  is 
determined  as  follows 

Ap  =  ^  fj  cos2ir(xpXj  +  j/pj/j  ZpZj) 

NA 

Bp  =  YL  ®'"  2jr(a:pij  +  ypPj  -h  Zpij) 

NA 
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Figure  3;  Data  Structures 


method  moved  the  atom  in  the  same  fashion  but  did 
not  utilize  chemical  information  about  the  structure. 
The  second  method  generally  gave  lower  R  values  but 
the  structures  produced  were  not  necessarily  feasible. 

We  also  used  two  main  methods  for  determining  the 
length  of  the  Markov  chains.  The  first  determined 
the  length  of  each  Markov  chain  by  looking  at  the 
most  recent  R  values  within  the  chain.  If  these  values 
did  not  vary  by  more  than  some  c,  then  the  Markov 
chain  was  terminated  [1].  The  second  method  set  the 
length  of  each  Markov  chain  as  a  function  of  the  cool¬ 
ing  value,  a.  This  was  accomplished  by  allowing  a 
Markov  chain  to  terminate  after  x  transitions  had  been 
accepted.  Since  as  cooling  progresses  the  number  of 
acceptances  decreases,  the  length  of  the  Markov  chain 
became  an  increasing  function  of  the  cooling  value.  For 
this  reason  an  upper  bound  was  placed  on  the  length 
of  the  Markov  chain  [1].  The  most  promising  results 
were  obtained  when  the  chain  length  was  a  function  of 
the  cooling  value.  Although  these  results  were  more 
promising  than  simulated  annealing  in  phase  space,  we 
still  were  unable  to  minimize  R. 


6  Grid  Method 


where  Xp,  yp  and  Zp  are  the  p**  Miller  indices,  Xj,yj 
and  Zj  are  coordinates  of  atom  j,  and  NA  is  the  set  of 
atoms  including  the  set  of  symmetry  elements  for  each 
atom.  Thus,  as  can  be  seen  in  Figure  3  each  Miller  in¬ 
dex,  Mi,  has  associated  with  it  an  intensity  value  If'il 
and  a  (unknown)  phase,  <f>i.  Given  p  active  processors 
and  P  pheises,  the  calculation  of  the  structure  factor 
can  be  performed  efficiently  in  parallel  by  assigning 
each  active  processor  the  calculation  of  Pfp  phases. 

Once  again  a  variety  of  perturbation  schemes  were 
attempted.  We  continued  the  use  of  conditional  cool¬ 
ing  lengths  and  allowed  the  cooling  rate,  a,  to  vary 
between  .8  <  o  <  .99.  Our  attention  then  focused  on 
determining  a  suitable  Markov  Chain  length  and  on  a 
method  of  perturbing  the  atoms. 

Due  to  the  nature  of  the  structure  factor  calcula¬ 
tion,  the  movement  of  one  atom  changes  the  entire  set 
of  phases.  Since  minimizing  /I  is  a  function  of  the 
phases,  we  restricted  the  number  of  atoms  that  were 
perturbed  at  each  iteration.  Two  main  methods  were 
implemented  for  perturbing  an  atom.  In  the  first,  we 
restricted  the  perturbation  to  be  within  a  cube  of  edge 
size  e,  where  .SA  <  e  <  GA.  Chemical  information 
was  used  by  requiring  that  the  perturbed  atom  not  be 
closer  than  1.2A  from  any  other  atom.  The  second 


It  is  conjectured  [2]  that  the  minimal  R  value  of  a 
structure  with  N  atoms  can  be  found  by  first  min¬ 
imizing  the  R  value  for  a  single  atom,  and  then  se¬ 
quentially  minimizing  the  R  value  of  the  set  consisting 
of  the  previous  atom(s)  and  one  additional  atom,  until 
all  N  atoms  have  been  placed.  Our  current  strategy 
exploits  this  conjecture. 

In  order  to  gain  insight  into  the  behavior  of  the  mini¬ 
mal  function,  we  performed  an  exhaustive  search  of  the 
three  dimensional  unit  cell  at  lattice  point  intervals  of 
.25A.  This  showed  a  function  which  varied  extremely 
rapidly,  leading  us  to  believe  the  function  was  much 
wilder  than  originally  suspected.  We  then  performed 
an  exhaustive  search  of  the  three  dimensional  unit  cell 
at  lattice  point  intervals  of  .1  A.  This  search  verified  the 
unpredictability  of  the  function.  For  example,  within 
a  distance  of  .3A  the  values  of  R  change  from  the  high¬ 
est  to  near  the  lowest.  This  rapid  fluctuation  explains 
why  attempts  to  utilize  simulated  annealing  on  this 
function  were  unsuccessful. 

As  discus-sed  previously,  the  data  used  for  the  above 
methods  consisted  of  choosing  the  top  n  Miller  in¬ 
dices  by  intensity  and  using  these  Miller  indices  to 
determine  the  corresponding  sets  of  triplets  and  quar¬ 
tets.  This  data,  which  can  be  termed  “high”  resolution 
data,  was  originally  chosen  since  the  estimates  for  It 
and  Iq  are  more  reliable.  In  hopes  that  the  minimal 
function  would  become  smoother  we  are  currently  us- 
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ing  a  “low”  resolution  data  set.  Our  “low”  resolution 
data  set  was  obtained  by  taking  the  j  lowest  d*  val¬ 
ues,  where  j  is  arbitrarily  chosen  to  be  greater  than  n. 
Given  reciprocal  lattice  axes  a*,b*,c*  and  Miller  index 
Mi  =  (x,y,z)  d*,  the  length  of  the  reciprocal  vector, 
is  equal  to  a*x  +  6*y  -1-  c*z.  These  j  Miller  indices  are 
then  sorted  in  increasing  order  by  their  \E\  values.  The 
first  n  Miller  indices  of  this  sorted  list  are  then  used  to 
generate  the  appropriate  triplets  and  quartets.  Using 
this  “low”  resolution  data  set  we  performed  exhaus¬ 
tive  searches  on  the  three  dimensional  unit  cell.  These 
searches  at  .25 A  and  .lA  lattice  point  intervals  showed 
the  function  to  be  much  smoother  than  it  was  with  the 
high  resolution  data. 

For  the  molecular  structure  that  we  are  currently 
working  with  the  first  atom  is  believed  to  be  critical,  as 
it  selects  the  origin  for  the  structure.  Some  structures, 
in  different  space  groups,  require  more  than  one  atom 
to  be  placed  before  the  origin  is  determined.  There¬ 
fore,  we  are  confident  that  if  we  can  determine  the 
first  atom  of  the  structure,  then  we  can  determine  the 
entire  structure  using  the  grid  method  on  each  suc¬ 
cessive  atom.  Thus,  we  plan  on  spending  the  bulk  of 
our  time  minimizing  R  with  respect  to  a  single  atom. 
We  are  currently  focusing  our  attention  on  the  ques¬ 
tion  of  how  fine  a  grid  is  needed  in  order  to  obtain  the 
minimum.  We  are  also  considering  “grid  refinement” 
methods  that  consist  of  multiple  stages  of  grid  applica¬ 
tions  with  successively  smaller  grid  intervals.  After  the 
grid  method  has  been  refined  sufficiently  for  a  single 
atom  minimum,  we  plan  on  generalizing  the  method 
to  larger  subsets  of  the  structure. 

7  Intel  iPSC/2  Implementation 

On  our  32  node  Intel  iPSC/2  hypercube,  we  have  im¬ 
plemented  the  minimization  techniques  previously  de¬ 
scribed  in  serial  while  the  structure  factor  and  R  value 
calculations  are  performed  in  parallel.  The  reason  for 
this  is  that  the  calculation  of  both  the  structure  factor 
and  R  are  computationally  expensive  relative  to  the 
overhead  of  the  minimization  techniques.  Thus  the 
implementation  focuses  on  exploiting  multiple  proces¬ 
sors  to  efficiently  compute  the  structure  factor  and  R 
value.  Once  we  have  obtained  satisfactory  solutions, 
we  will  consider  parallelizing  the  minimization  tech¬ 
niques. 

The  data  is  initially  distributed  so  that  each  of 
the  P  processors  have  T/P  triplets  and  QJP  quar¬ 
tets,  plus  consistent  copies  of  all  of  the  remaining 
data  structures.  Each  processor  then  computes  its 
set  of  T/P  triplets  and  Q/P  quartets.  These  partial 
sums  are  then  summed  to  node  0  by  recursive  halv¬ 


ing,  after  which  node  0  performs  the  final  division  by 
which  is  a  constant  that  is  com¬ 
puted  during  the  initialization  of  the  data  set.  Notice 
that  the  calculation  of  any  partial  sum  of  the  mini¬ 
mal  function  may  require  the  use  of  the  entire  set  of 
phases  which  is  why  each  active  node  must  contain  a 
consistent  copy  of  the  entire  set  of  phases. 

This  implementation  evenly  distributes  the  work 
and  the  data  set  among  the  processors.  Therefore, 
increasing  the  number  of  processors  not  only  allows 
for  a  faster  solution,  but  for  much  larger  problems  to 
be  solved.  As  can  be  seen  from  Figure  4,  a  near  per¬ 
fect  linear  speed  up  is  observed  in  tests  ranging  from 
4  (the  minimum  number  of  nodes  required  to  hold  all 
of  the  data)  to  32  (the  m^Lximum  number  of  nodes  on 
our  machine)  nodes. 

The  calculation  of  the  structure  factor  for  n  phases, 
given  P  active  processors  and  a  atoms,  is  divided  into 
P  subsets  of  phases.  Each  processor  computes  n/P 
structure  factors.  Since  the  computation  of  any  one 
structure  factor  requires  the  use  of  all  a  atoms,  each 
processor  must  maintain  a  consistent  copy  of  all  a 
atoms.  In  addition,  all  processors  must  maintain  a 
complete  set  of  phases  for  the  calculation  of  R.  To 
achieve  this,  the  subsets  of  phases  produced  by  the 
structure  factor  calculation  are  combined  by  recursive 
halving  to  node  0.  Although  the  order  of  the  phases 
within  each  subset  will  remain  the  same  during  recur¬ 
sive  halving,  the  order  of  the  subsets  will  not.  There¬ 
fore,  node  0  (the  master)  must  order  the  subsets  of 
phases  prior  to  distributing  the  entire  set  of  phases  to 
the  slaves  (all  active  processors)  by  recursive  doubling. 

This  implementation  of  the  structure  factor  calcu¬ 
lation  has  produced  near  linear  speed-up,  as  can  be 
seen  in  Figure  5.  However,  the  speed-up  of  the  struc- 


Figure  4:  Time  of  R  Calculation  for  a  29  Atom  Struc¬ 
ture  using  300  Phases. 
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Figure  5:  Time  of  Structure  Factor  Calculation  for  29 
Atom  Structure  with  300  Phases 

ture  factor  calculation  is  not  as  efficient  as  that  of  the 
R  value  calculation,  since  the  structure  factor  calcu¬ 
lation  has  the  additional  overhead  of  re-ordering  and 
broadcasting  the  phases. 

7.1  Phase  Space  Simulated  Annealing 

For  the  implementation  of  simulated  annealing  in 
phase  space  on  P  active  processors,  node  0  performs 
simulated  annealing,  with  each  active  processor  coop¬ 
erating  in  the  calculation  of  R.  This  can  be  viewed  as 
a  master/slave  implementation.  At  each  iteration  of 
the  simulated  annealing  process  a  subset  of  the  phases 
is  perturbed  in  serial,  concurrently  by  all  processors. 
To  insure  that  all  processors  maintain  identical  sets 
of  phases,  a  message  is  broadcast  to  all  processors, 
by  recursive  doubling,  with  the  current  random  num¬ 
ber  generator  seed.  Since  the  same  random  number 
generator  is  used  on  all  processors  we  know  all  pro¬ 
cessors  will  perturb  the  phases  in  the  same  manner. 
The  master  determines,  by  the  properties  of  simulated 
annealing,  whether  to  accept  or  reject  the  perturbed 
set  of  phases.  This  decision  is  then  broadceist  to  the 
slaves  by  recursive  doubling.  To  reduce  the  amount 
of  message  passing,  the  accept/reject  decision  and  the 
current  seed  are  sent  in  the  same  message. 

7.2  Atom  Space  Simulated  Annealing 

A  master/slave  model  is  also  used  for  the  atom  space 
simulated  annealing  implementation.  Although  theo¬ 
retically  very  different,  the  implementations  of  phase 
space  and  atom  space  are  very  similar.  The  master 
performs  simulated  annealing,  while  the  slaves  coop¬ 
erate  in  the  calculation  of  R  and  the  structure  factor. 


At  each  iteration  of  the  simulated  annealing  process 
a  subset  of  atoms  are  perturbed.  A  copy  of  the  en¬ 
tire  set  of  atoms  on  all  processors  must  be  maintained 
so  that  the  structure  factor  calculation  may  be  per¬ 
turbed  in  parallel,  as  previously  described.  Therefore, 
to  reduce  the  amount  of  message  passing,  we  chose  to 
perform  the  perturbation  in  serial,  concurrently  by  all 
processors.  We  insure  the  perturbed  set  of  atoms  are 
consistent  through  out  all  processors  by  broadcasting 
the  current  seed  to  all  nodes. 

7.3  Grid  Method 

The  grid  method  is  also  implemented  using  a  mas¬ 
ter/slave  model.  The  calculation  of  the  structure  fac¬ 
tor  and  R  are  the  same  as  in  the  simulated  annealing 
implementations.  Again  each  node  contains  the  cur¬ 
rent  structure  and  set  of  ph2ises.  Since  we  are  per¬ 
forming  a  systematic  exhaustive  search  of  the  three 
dimensional  unit  cell  the  current  structure  is  e2isily 
maintained  on  all  nodes. 

8  Additional  Architectures 

In  addition  to  the  hypercube,  we  have  implemented  all 
three  methods  on  the  Connection  Machine  CM-2  at 
NPAC.  Our  current  implementation  requires  approxi¬ 
mately  0.2  sec.  to  calculate  R  and  0.35  sec.  to  calcu¬ 
late  the  structure  factor  for  a  29  atom  structure  with 
300  phases  on  16K  processors  of  the  Connection  Ma¬ 
chine.  We  have  also  recently  begun  using  a  network 
of  12  Sun  4  workstations.  On  one  workstation  we  can 
compute,  in  serial,  R  in  approximately  4.3  seconds  for 
300  phases.  We  plan  on  using  the  network  to  divide 
the  unit  cell  into  12  cells  and  have  each  workstation 
perform  an  exhaustive  search  of  its  subcell. 
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Abstract 

A  new  modeling  technique  (the  Lattice  Grain 
Model)  is  presented  for  the  simulation  of  two-dimen¬ 
sional  granular  systems  involving  large  numbers  (~ 
10'*  to  10*)  of  grains.  These  granular  systems  (e.g,, 
rock  slides,  planetary  rings,  industrial  powders,  etc.) 
may  include  both  high  shear  rate  regions  as  well  as 
static  plugs  of  grains  and  cannot  easily  be  handled 
within  the  framework  of  existing  continuum  theories 
such  as  soil  mechanics. 

The  Lattice  Grain  Model  (LGrM)  is  similar  to 
the  Lattice  Gas  Model  (LGM),  which  was  introduced 
as  a  discrete  model  of  fluids,  in  that  the  computation 
is  carried  out  by  means  of  cellular  automata  which 
evolve  according  to  a  simple  set  of  rules  based  on  lo¬ 
cal  interactions.  This  allows  large  simulations  to  be 
programmed  onto  a  hypercube  concurrent  processor 
in  a  straightforward  manner.  However,  it  differs  from 
LGM  in  that  it  includes  the  inelastic  collisions  and 
volume-filling  properties  of  macroscopic  grains. 

Examples  to  be  presented  will  include  Couette 
flow,  flow  through  an  hourglass,  and  gravity-driven 
flows  around  obstacles. 

Introduction 

Physical  systems  comprised  of  discrete,  macro¬ 
scopic  particles  or  grains  which  are  not  bonded  to  one 
another  occur  importantly  in  civil,  chemical,  and  agri¬ 
cultural  engineering,  as  well  as  in  natural  geological 
and  planetary  environments.  Granular  systems  are 
observed  in  rock  slides,  sand  dunes,  clastic  sediments, 
snow  avaJanches,  and  planetary  rings,  while  in  engi¬ 
neering  and  industry  they  are  found  in  connection  with 
the  processing  of  cereal  grains,  coal,  gravel,  oil  shale, 
and  powders,  and  are  well-known  to  pose  important 
problems  associated  with  the  movement  of  sediments 
by  streams,  rivers,  waves,  and  the  wind. 

The  standard  approach  to  the  theoretical  mod¬ 
eling  of  multiparticle  systems  in  physics  has  been  to 
treat  the  system  as  a  continuum  and  to  formulate  the 
model  in  terms  of  differential  equations.  As  an  ex¬ 
ample,  the  science  of  soil  mechanics  has  traditionally 
focussed  mainly  on  quasi-static  granular  systems,  a 
prime  objective  being  to  define  and  predict  the  con¬ 
ditions  under  which  failure  of  the  granular  soil  system 


will  occur.  Soil  mechanics  is  a  macroscopic  continuum 
model  requiring  sm  explicit  constitutive  law  relating, 
say,  stress  and  strain;  and  while  very  successful  for  the 
low-strain  quasi-static  applications  for  which  it  is  in¬ 
tended,  it  is  not  clear  how  it  can  be  generalized  to  deal 
with  the  high-strain,  explicitly  time-dependent  phe¬ 
nomena  which  characterize  a  great  many  other  granu¬ 
lar  systems  of  interest.  Attempts  at  obtaining  a  gen¬ 
eralized  theory  of  granular  systems  using  a  differential 
equation  formalism  [1]  have  met  with  limited  success. 

An  alternate  approach  to  formulating  physical 
theories  can  be  found  in  the  concept  of  cellular  au¬ 
tomata,  which  was  first  proposed  by  Von  Neumann  in 
1948.  In  this  approach,  the  space  of  a  physical  prob¬ 
lem  would  be  divided  up  into  many  small,  identical 
cells  each  of  which  would  be  in  one  of  a  finite  number 
of  states.  The  state  of  a  cell  would  evolve  according  to 
a  rule  which  is  both  local  (involves  only  the  ceU  itself 
and  nearby  cells)  and  universal  (all  cells  are  updated 
simultaneously  using  the  same  rule). 

The  Lattice  Grain  Model  [2]  (LGrM)  we  discuss 
here  is  a  microscopic,  explicitly  time-dependent,  cel¬ 
lular  automata  model,  and  can  be  applied  naturally 
to  high-strain  events.  LGrM  carries  some  attributes 
of  both  particle  dynamics  models  [3,  4]  (PDM),  which 
are  based  explicitly  on  Newton’s  second  law,  and  lat¬ 
tice  gas  models  [3]  (LGM),  in  that  its  fundamental 
element  is  a  discrete  particle,  but  differs  from  these 
substantially  in  detail.  Here  we  describe  the  essential 
features  of  LGrM,  compare  the  model  with  both  PDM 
and  LGM,  and  finally  discuss  some  applications. 

Comparison  to  Particle  Dynamics  Models 

The  purpose  of  the  lattice  grain  model  is  to  pre¬ 
dict  the  behavior  of  large  numbers  of  grains  (10,000  to 
1,000,000)  on  scales  much  larger  than  a  grain  diameter. 
In  this  respect,  it  goes  beyond  particle  djmamics  cal¬ 
culations  which  are  limited  to  no  more  than  ~  10, 000 
grains  by  currently  available  computing  resources  (3, 
4].  The  particle  dynamics  models  follow  the  motion  of 
each  individual  grain  exactly,  and  may  be  formulated 
in  one  of  two  ways  depending  upon  the  model  adopted 
for  particle-particle  interactions. 

In  one  formulation,  the  interparticle  contact  times 
are  assumed  to  be  of  finite  duration,  and  each  parti¬ 
cle  may  be  in  simultaneous  contact  with  several  others 
[3j.  Each  particle  obeys  Newton’s  law,  F  —  ma,  and  a 
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detailed  integration  of  the  equations  of  motion  of  each 
particle  is  performed.  In  this  form,  while  useful  for 
applications  involving  a  much  smaller  number  of  par¬ 
ticles  than  LGrM  allows,  PDM  cannot  compete  with 
LGrM  for  systems  involving  large  numbers  of  grains 
because  of  the  complexity  of  PDM  “automata”. 

In  the  second,  simpler  formulation,  the  interpar¬ 
ticle  contact  times  are  assumed  to  be  of  infinitesimal 
duration,  and  particles  undergo  only  binary  collisions 
(the  hard-sphere  collisional  models)  [4].  Hard-sphere 
models  usually  rely  upon  a  collision-list  ordering  of  col¬ 
lision  events  to  avoid  the  necessity  of  checking  all  pairs 
of  particles  for  overlaps  at  each  time  step.  In  regions 
of  high  particle  number  density,  collisions  are  very  fre¬ 
quent;  and  thus  in  problems  where  such  high  density 
zones  appear,  hard-sphere  models  spend  most  of  their 
time  moving  particles  through  very  small  distances  us¬ 
ing  very  small  time  steps.  In  granular  flow,  zones  of 
stagnation  where  particles  are  very  nearly  in  contact 
much  of  the  time  are  common,  and  the  hard-sphere 
model  is  therefore  unsuitable,  at  least  in  its  simplest 
form,  as  a  model  of  these  systems.  LGrM  avoids  these 
difficulties  because  its  time-stepping  is  controlled  not 
by  a  collision  list  but  by  a  scan  frequency  which  in 
turn  is  a  function  of  the  speed  of  the  fastest  particle 
and  is  independent  of  number  density.  Furthermore, 
although  fundamentally  a  collisional  model,  LGrM  can 
also  mimic  the  behavior  of  consolidated  or  stagnated 
zones  of  granular  material  in  a  manner  which  will  be 
described  below. 

Comparison  to  Lattice  Gas  Models 

LGrM  closely  resembles  LGM  [5]  in  some  respects. 
First,  for  2D  applications,  the  region  of  space  in  which 
the  particles  are  to  move  is  discretized  into  a  triangu¬ 
lar  lattice-work,  upon  each  node  of  which  can  reside  a 
particle.  The  particles  are  capable  of  moving  to  neigh¬ 
boring  cells  at  each  tick  of  the  clock,  subject  to  cer¬ 
tain  simple  rules.  Finally,  two  particles  arriving  at  the 
same  cell  (LGM)  or  adjacent  cells  (LGrM)  at  the  same 
time  may  undergo  a  “collision”  in  which  their  outgoing 
velocities  are  determined  according  to  specified  rules 
chosen  to  conserve  momentum. 

Each  of  the  particles  in  LGM  has  the  same  mag¬ 
nitude  of  velocity  and  is  allowed  to  move  in  one  of 
six  directions  along  the  lattice,  so  that  each  particle 
travels  exactly  one  lattice  spacing  in  each  time  step. 
The  single  velocity  magnitude  means  that  all  colli¬ 
sions  between  particles  are  perfectly  elastic  and  that 
energy  conservation  is  maintained  simply  through  par¬ 
ticle  number  conservation.  It  also  means  that  the  tem¬ 
perature  of  the  gas  is  uniform  throughout  time  and 
space,  thus  limiting  the  application  of  LGM  to  prob¬ 
lems  of  low  Mach  number.  An  exclusion  principle  is 


maintained  in  which  no  two  particles  of  the  same  ve¬ 
locity  may  occupy  one  lattice  point.  Thus  each  lattice 
point  may  have  no  more  than  six  particles,  and  the 
state  of  a  lattice  point  can  be  recorded  using  only  six 
bits. 

LGrM  dififers  from  LGM  in  having  many  possible 
velocity  states,  not  just  six.  In  particular,  in  LGrM 
not  only  the  direction  but  the  magnitude  of  the  ve¬ 
locity  can  change  in  each  collision.  This  is  a  necessary 
condition  because  the  collision  of  two  macroscopic  par¬ 
ticles  is  always  inelastic,  so  that  mechanical  energy  is 
not  conserved.  The  LGrM  particles  satisfy  a  somewhat 
different  exclusion  principle:  no  more  than  one  parti¬ 
cle  at  a  time  may  occupy  a  single  site.  This  exclusion 
principle  allows  LGrM  to  capture  some  of  the  volume¬ 
filling  properties  of  granular  material,  in  particular  to 
be  able  to  approximate  the  behavior  of  static  granular 
masses. 

The  determination  of  the  time  step  is  more  crit¬ 
ical  in  LGrM  than  in  LGM.  If  the  time  step  is  long 
enough  that  some  particles  travel  several  lattice  spac- 
ings  in  one  clock  tick,  there  arises  the  problem  of  find¬ 
ing  the  intersection  of  particle  trajectories.  This  in¬ 
volves  much  computation  and  defeats  the  purpose  of 
an  automata  approach.  A  very  short  time  step  would 
imply  that  most  particles  would  not  move  even  a  single 
lattice  spacing.  Here  we  choose  a  time  step  such  that 
the  fastest  particle  will  move  exactly  one  lattice  spac¬ 
ing.  A  “position  offset”  is  stored  for  each  of  the  slower 
particles,  which  are  moved  accordingly  when  the  offset 
exceeds  one-half  lattice  spacing.  These  extra  require¬ 
ments  for  LGrM  automata  imply  a  slower  computa¬ 
tion  speed  than  expected  in  LGM  simulations;  but, 
as  a  dividend,  we  can  compute  inelastic  grain  flows  of 
potential  engineering  and  geophysical  interest. 

The  Rules  for  the  Lattice  Grain  Model 

In  order  to  keep  the  particle-particle  interaction 
rules  as  simple  as  possible,  all  interparticle  contacts, 
whether  enduring  contacts  or  true  collisions,  will  be 
modeled  as  collisions.  Those  collisions  which  model  en¬ 
during  contacts  will  transmit  in  each  time  step  an  im¬ 
pulse  equal  to  the  force  of  the  enduring  contact  times 
the  time  step.  The  fact  that  collisions  take  place  be¬ 
tween  particles  on  adjacent  lattice  nodes  means  that 
some  particles  may  undergo  up  to  six  collisions  in  a 
time  step.  For  simplicity,  these  collisions  will  be  re¬ 
solved  as  a  series  of  binary  collisions.  The  order  in 
which  these  collisions  are  calculated  at  each  lattice 
node,  as  well  as  the  order  in  which  the  lattice  nodes 
are  scanned,  is  now  an  important  consideration. 

The  rules  of  the  Lattice  Grain  Model  may  be  sum¬ 
marized  as  follows: 


523 


1.  The  particles  reside  on  the  nodes  of  a  2D  trian¬ 
gular  lattice,  obeying  the  exclusion  principle  that 
no  node  may  have  more  than  one  particle. 

2.  Each  particle  has  two  components  of  velocity, 
which  may  take  on  any  value.  At  the  beginning 
of  each  time  step,  each  particle’s  velocity  is  incre¬ 
mented  due  to  the  acceleration  of  gravity. 

3.  The  size  of  each  time  step  is  set  so  that  the  fastest 
particle  will  travel  one  lattice  spacing  in  that  time 
step. 

4.  Two  components  of  a  “position  offset”  are  main¬ 
tained  for  each  particle.  This  offset  is  incremented 
after  the  velocities  in  each  time  step  according  to 
gravitational  acceleration  and  the  particle’s  veloc¬ 
ity: 

Aqi  =  ViAt  -1-  -QiAt^ 

where: 
i  =  1,2, 

Aqi  =  ith  component  of  increment  in  position  offset, 
Vi  =  ith  component  of  particle  velocity, 

Qi  =  ith  component  of  gravitational  acceleration. 
At  =  current  time  step. 

Once  the  offset  exceeds  half  the  distance  to  the 
nearest  lattice  node,  and  that  node  is  empty,  the 
particle  is  moved  to  that  node,  and  its  offset  is 
decremented  appropriately.  Also,  in  a  collision, 
the  component  of  the  offset  along  the  line  con¬ 
necting  the  centers  of  the  colliding  particles  is  set 
to  zero. 

5.  The  order  in  which  the  lattice  is  scanned  is  cho¬ 
sen  so  as  not  to  create  a  coupling  between  the 
scan  pattern  and  the  particle  motions.  Thus  the 
particle  position  updates  are  done  on  every  third 


lattice  point  of  every  third  row,  with  this  pattern 
being  repeated  nine  times  so  as  to  cover  all  lattice 
sites. 

6.  Particle  collisions  are  calculated  assuming  that 
they  are  smooth,  hard  disks  with  a  given  coef¬ 
ficient  of  restitution.  Particles  on  adjacent  nodes 
are  assumed  to  collide  if  their  relative  velocity  is 
bringing  them  together.  The  following  order  has 
been  adopted  for  evaluating  possible  collisions  on 
odd  time  steps:  3b,  3c,  3f,  2f,  2c,  2b,  4b,  4c,  4f,  If, 
Ic,  lb;  and  for  even  time  steps:  lb,  Ic,  If,  4f,  4c, 
4b,  2b,  2c,  2f,  3f,  3c,  3b  (where  the  lattice  num¬ 
bers  and  collision  directions  are  defined  in  Figure 
1)- 

7.  In  order  to  incorporate  a  container,  wall,  or  other 
barrier  within  these  rules,  a  second  type  of  parti¬ 
cle  is  introduced:  the  wall  particle.  This  particle 
is  similar  to  the  movable  particles,  and  interacts 
with  them  through  binary  collisions  (with  a  sepa¬ 
rately  defined  inelasticity),  but  is  regarded  as  hav¬ 
ing  infinite  mass.  To  allow  for  the  introduction  of 
shearing  motion  from  a  wall  (as  in  a  Couette  flow 
problem),  the  particles  making  up  the  wall  are 
given  a  common  constant  velocity,  which  is  used 
in  the  usual  fashion  for  calculating  the  results  of 
collisions.  However,  the  position  of  the  wall  par¬ 
ticles  in  the  lattice  remains  fi.xed  throughout  the 
simulation. 

8.  Even  though  a  single  particle  does  not  accu¬ 
rately  predict  the  trajectory  of  a  single  grain, 
we  nonetheless  regard  each  particle  as  represent¬ 
ing  one  grain  when  we  are  extracting  informa¬ 
tion  from  the  simulation  regarding  the  behanor 
of  groups  of  grains.  Thus,  the  size  of  one  particle, 
as  well  as  the  spacing  between  lattice  points,  is 
taken  to  be  one  grciin  diameter. 


Figure  1:  Definition  of  lattice  numbers  and  collision  directions 
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The  transmission  of  “static”  contact  forces  within 
a  mass  of  grains  (as  in  grains  at  rest  in  a  gravitational 
held)  is  handled  naturally  within  the  above  framework. 
Even  though  a  particle  in  a  static  mass  of  grains  may 
be  nominally  at  rest,  its  velocity  may  be  nonzero  (due 
to  gravitational  or  pressure  forces);  and  it  will  transmit 
the  appropriate  force  (in  the  form  of  an  impulse)  to 
the  particles  under  it  by  means  of  collisions.  When 
these  impulses  are  averaged  over  several  time  steps, 
the  proper  weights  and  pressures  will  emerge. 

Implementation  on  a 
Parallel  Processor  Computer 

When  implementing  this  algorithm  on  a  com¬ 
puter,  what  is  stored  in  the  computer’s  memory  is 
information  concerning  each  point  in  the  lattice,  re¬ 
gardless  of  whether  or  not  there  is  a  particle  at  that 
lattice  point.  This  allows  for  very  efficient  checking  of 
the  space  around  each  particle  for  the  presence  of  other 
particles  (i.e.,  information  concerning  the  six  adjacent 
points  in  a  triangular  lattice  will  be  found  at  certain 
known  locations  in  memory).  The  need  to  keep  infor¬ 
mation  on  empty  lattice  points  in  memory  does  not 
entail  as  great  a  penalty  as  might  be  thought;  many 
lattice  grain  model  problems  involve  a  high  density  of 
particles,  typically  one  for  every  one  to  four  lattice 
points,  and  the  memory  cost  per  lattice  point  is  not 
large.  The  memory  requirements  for  the  implemen¬ 
tation  of  LGrM  as  described  here  are  5  variables  per 
Htticc  site;  two  components  of  position,  two  compo¬ 
nents  of  velocity,  and  one  status  variable  which  denotes 
an  empty  site,  an  occupied  site,  or  a  bounding  “wall” 
particle.  If  each  variable  is  stored  using  4  bytes  of 
memory,  then  each  lattice  point  requires  20  bytes. 

The  standard  configuration  for  a  simulation  con¬ 
sists  of  a  lattice  with  a  specified  number  of  rows  and 
columns,  bounded  at  the  top  and  bottom  by  two  rows 
of  wall  particles  (thus  forming  the  top  and  bottom 
walls  of  the  problem  space),  and  with  left  and  right 
edges  connected  together  to  form  periodic  boundary 
conditions.  Thus  the  boundaries  of  the  lattice  are 
handled  naturally  within  the  normal  position  updat¬ 
ing  and  collision  rules,  with  very  little  additional  pro¬ 
gramming.  (Note:  since  the  gravitational  acceleration 
can  point  in  an  arbitrary  direction,  the  top  and  bot¬ 
tom  walls  can  become  side  walls  for  chute  flow.  Also, 
the  periodic  boundary  conditions  can  be  broken  by  the 
placement  of  an  additional  wall,  if  so  desired.) 

Because  of  the  nearest-neighbor  type  interactions 
involved  in  the  model,  the  computational  scheme 
was  well  suited  to  an  NCUBE  parallel  processor. 
This  machine  consists  of  512  processors,  each  with 
512  kilobytes  of  memory,  connected  together  as  a  9- 
dimensional  hypercube,  along  with  a  host  computer. 


For  the  purpose  of  dividing  up  the  problem,  the  hyper¬ 
cube  architecture  is  unfolded  into  a  two-dimensional 
array,  and  each  processor  is  given  a  roughly  equal-area 
section  of  the  lattice.  The  only  interaction  between 
sections  will  be  along  their  common  boundaries,  thus 
each  processor  will  only  need  to  exchange  information 
with  its  eight  immediate  neighbors.  The  program  itself 
was  written  in  C  under  the  Cubix/CrOSIII  operating 
system.  With  Cubix,  only  a  program  for  the  nodes  of 
the  hypercube  needs  be  written;  no  separate  program 
for  the  host  computer  is  required. 

Simulations 

The  LGrM  simulations  performed  so  far  have  in¬ 
volved  from  ~  10^  to  10®  automata.  Trial  applica¬ 
tion  runs  included  2D,  vertical,  time-dependent  flows 
in  several  geometries  —  Couette  flow,  flow  out  of  an 
hourglass-shaped  hopper,  and  flow  down  verticaJ  chan¬ 
nels  with  embedded  obstacles. 

The  standard  Couette  flow  configuration  consists 
of  a  fluid  confined  between  two,  flat,  parallel  plates 
of  infinite  extent,  without  any  gravitational  acceler¬ 
ations.  The  plates  move  in  opposite  directions  with 
velocities  that  are  equal  and  that  are  parallel  to  their 
surfaces,  which  results  in  the  establishment  of  a  veloc¬ 
ity  gradient  and  a  shear  stress  in  the  fluid.  For  flu¬ 
ids  which  obey  the  Navier-Stokes  equation,  an  analyt¬ 
ical  solution  is  possible  in  which  the  velocity  gradient 
and  shear  stress  are  constant  across  the  channel.  If, 
however,  we  replace  the  fluid  by  a  system  of  inelastic 
grains,  the  velocity  gradient  will  no  longer  necessarily 
be  constant  across  the  channel.  Typically,  stagnation 
zones  or  plugs  form  in  the  center  of  the  channel  with 
thin  shear-bands  near  the  walls.  Shear-band  forma¬ 
tion  in  flowing  granular  materials  has  been  analyzed 
earlier  by  Half  and  others  [6]  based  on  kinetic  theory 
models. 

The  simulation  was  carried  out  with  5760  grains, 
located  in  a  channel  60  lattice  points  wide  by  192  long. 
Due  to  the  periodic  boundary  conditions  at  the  left 
and  right  ends,  the  problem  is  effectively  infinite  in 
length.  The  first  simulation  is  intended  to  reproduce 
the  standard  Couette  flow  for  a  fluid;  consequently 
the  particle-particle  collisions  were  given  a  coefficient 
of  restitution  of  1.0  (».e.,  perfectly  elastic  collisions) 
and  the  particle-wall  collisions  were  given  a  .75  coef¬ 
ficient  of  restitution.  The  inelasticity  of  the  particle- 
wall  collisions  is  needed  to  simulate  the  conduction  of 
heat  (which  is  being  generated  within  the  fluid)  from 
the  fluid  to  the  walls.  The  simulation  was  run  until 
an  equilibrium  was  established  in  the  channel  (Figure 
2a).  The  average  x-  and  y-components  of  velocity  and 
the  second  moment  of  velocity,  as  functions  of  distance 
across  the  channel  are  plotted  in  Figure  2b. 
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Figure  2a:  Elastic  particle  Couette  flow. 


The  second  simulation  used  a  coeflScient  of  resti¬ 
tution  of  .75  for  both  the  particle-particle  and  particle- 
wall  collisions.  The  equilibrium  results  are  shown  in 
Figures  3a  and  3b.  As  can  be  seen  from  the  plots, 
the  flow  consists  of  a  central  region  of  particles  com¬ 
pacted  into  a  plug,  with  each  particle  having  almost 
no  velocity.  Near  each  of  the  moving  walls,  a  region  of 
much  lower  density  has  formed  in  which  most  of  the 


Figure  2b:  X-component  (1),  y-component  (2), 
and  second  moment  (3)  of  velocity. 

shearing  motion  occurs.  Note  the  increase  in  value  of 
the  second  moment  of  velocity  (the  granular  “thermal 
velocity”)  near  the  walls,  indicating  that  grains  in  this 
area  arc  being  “heated”  by  the  high  rate  of  shear.  It 
is  interesting  to  note  that  these  flows  are  turbulent  in 
the  sense  that  shear  stress  is  a  quadratic,  not  a  linear, 
function  of  shear  rate. 


Figure  3a:  Inelastic  particle  Couette  flow. 


Figure  3b:  X-component  (1),  y-component  (2), 
and  second  moment  (3)  of  velocity. 
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In  the  second  problem,  the  flow  of  grains  through 
a  hopper  or  an  hourglass,  with  em  opening  only  a  few 
grain  diameters  wide,  was  studied;  the  driving  force 
was  gravity.  This  is  an  example  of  a  granular  system 
which  contains  a  wide  range  of  densities,  from  groups 
of  grains  in  static  contact  with  one  another  to  groups 
of  highly  agitated  grains  undergoing  true  binary  col¬ 
lisions.  Here,  the  number  of  particles  used  was  8310; 
and  the  lattice  was  240  points  long  by  122  wide.  Addi¬ 
tional  walls  were  added  to  form  the  sloped  sides  of  the 
bin  and  to  close  off  the  bottom  of  the  lattice  so  as  to 
prevent  the  periodic  boundary  conditions  from  reintro¬ 
ducing  the  falling  particles  back  into  the  bin  (Figure 
4a).  This  is  a  typical  feature  of  automata  modeling: 


that  it  is  often  easier  to  configure  the  simulation  to  re¬ 
semble  a  real  experiment  —  in  this  case  by  explicitly 
“catching”  spent  grains  —  than  by  reprogramming  the 
basic  code  to  erase  such  puticles. 

The  hourglass  flow.  Figure  4b,  showed  internal 
shear  zones,  regions  of  stagnation,  free-surface  evolu¬ 
tion  toward  an  angle  of  repose,  and  an  exit  flow  rate 
approximately  independent  of  pressure  head,  as  ob¬ 
served  experimentally  [7].  It  is  hard  to  imagine  that 
one  could  solve  a  partial  differential  equation  describ¬ 
ing  such  a  complex,  multiple-domain,  time-dependent 
problem,  even  if  the  right  equation  were  known  (which 
is  not  the  case). 
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Figure  4a:  Initial  condition  of  hourglass.  Figure  4b:  Hourglass  flow  after  2048  time  steps. 
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Conclusion 


Another  class  of  problems  studied  involve  the  flow 
of  grains  around  obstacles  of  different  shapes.  These 
flows  were  observed  experimentally  by  Nedderman, 
Davies,  and  Horton  [8]  using  a  channel  width  of  20 
cm  and  must2ird  seeds  of  .228  cm  diameter  confined 
between  two  glass  plates  spaced  2.3  cm  apart,  giving 
a  nearly  two-dimensional  system.  The  simulation  con¬ 
tained  16,384  particles  in  a  lattice  of  288  points  by 
130  points.  The  diameter  of  the  circular  obstacle  and 
the  side  of  the  square  obstacle  were  each  one-half  the 
width  of  the  channel.  Two  simulations.  Figures  5a  and 
5b,  showed  features  qualitatively  similar  to  those  ob¬ 
served  in  the  laboratory  studies  [8],  including  stagna¬ 
tion  zones  upstream  of  an  obstacle,  and  void  formation 
downstream. 


These  exploratory  numerical  experiments  show 
that  an  automata  approach  to  granular  dynamics 
problems  can  be  implemented  on  parallel  computing 
machines.  Further  work  remains  to  be  done  to  assess 
more  quantitatively  bow  well  such  calculations  reflect 
the  real  world,  but  the  prospects  are  intriguing. 
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Figure  5a:  Flow  around  a  circular  obstacle. 


Figure  5b:  Flow  around  a  square  obstacle. 
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Introduction 

The  uncovering  of  the  earth’s  interior  encompasses 
two  important  procedures.  The  first  step  is  to  make 
a  prediction  of  the  earth’s  interior  based  on  scien¬ 
tific  data  collected  from  geophones  and  a  knowledge 
of  the  geology  of  the  area  in  which  the  data  is  being 
collected.  The  second  step  is  to  improve  this  predic¬ 
tion  by  numerical  simulation  of  the  wave  equations. 
The  former  step  is  referred  to  as  the  inverse  problem 
and  the  latter  as  the  forward  problem.  Both  problems 
are  difficult  and  interconnected.  Once  there  is  scien¬ 
tific  confidence  that  the  structure  of  the  interior  of 
the  earth  is  adequately  known,  this  structure  is  en¬ 
coded  and  analyzed  under  various  stresses,  strains, 
and  pressures.  Using  wave  equations  to  simulate 
these  types  of  problems  involves  an  immense  number 
of  calculations,  overburdening  the  largest  available 
vector  and  parallel  computers.  In  this  paper,  we  dis¬ 
cuss  these  aspects  and  indicate  how  the  distributed 
memory  computations  can  be  used  to  solve  them  in 
an  efficient  manner. 

The  inverse  problem  in  which  we  are  interested  is 
to  determine  not  only  the  depth,  but  also  the  shape, 
of  complex  structures  below  the  surface  of  the  earth. 
That  is,  given  a  known  disturbance  of  the  earth  and 
the  record  of  the  geophones  caused  by  this  distur¬ 
bance,  we  wish  to  accurately  describe  the  structure. 
The  forward  problem  then  takes  this  structure  and 
the  disturbances  as  its  data  and  produces  the  wave 
field  over  time,  testing  for  comparison  with  surface 
waves  and  geophone  readings.  In  an  iterative  manner, 
the  inverse  solver  utilizes  these  results  to  improve  the 
initial  guess  of  the  underlying  structure  of  the  earth. 
This  process  is  repeated  until  convergence  to  the  true 
structure  is  obtained  to  within  a  specified  degree  of 
accuracy.  Once  this  accurate  structure  is  obtained, 
the  forward  solver  is  used  to  test  how  this  section 
of  the  earth  reacts  to  various  pressures,  stresses,  and 
strains. 


Many  of  the  inherent  problems  encountered  in  nu¬ 
merically  approximating  the  wave  equations  to  sim¬ 
ulate  the  propagation  of  sound  waves  in  the  earth’s 
interior  have  been  well  documented  and  analyzed  in 
the  literature.  In  this  paper,  we  only  discuss  a  few 
of  the  major  difficulties.  One  of  the  most  commonly 
mentioned  problems  is  related  to  the  vastness  of  the 
earth’s  interior.  The  numerical  solution  of  the  wave 
equation  requires  the  placing  of  grid  blocks  over  a 
finite  region  and  therefore  requires  boundary  condi¬ 
tions  for  the  computational  domain.  However,  the 
earth’s  interior  (for  the  problem  of  interest)  has  no 
subsurface  boundaries.  Therefore,  the  boundary  con¬ 
ditions  imposed  must  model  a  ‘void’  boundary  result¬ 
ing  in  what  are  known  in  the  literature  as  absorb¬ 
ing  or  radiating  boundary  conditions.  We  incorporate 
absorbing  boundary  conditions  and  load-balancing 
in  the  distributed  memory  setting  of  computation. 
Since  memory  is  limited,  the  sizes  for  the  grid  blocks 
are  bounded  from  below.  The  hyperbolic  nature  of 
the  model  equations  requires  that  a  correspondingly 
small  time  step  be  used  to  avoid  dispersion  and  nu¬ 
merical  instabilities.  The  limit  on  the  grid  size  also 
restricts  how  accurately  the  interfaces  of  the  complex 
structures  can  be  represented.  In  this  paper,  we  con¬ 
sider  all  these  problems  in  the  forward  model  using 
finite  differences  and  distributed  memory  computing, 
and  strongly  argue  that  the  improper  treatment  of 
any  of  the  above-mentioned  topics  can  lead  to  gross 
misinterpretations  in  the  inverse  problem. 

The  equations  that  we  model  for  the  pressure  dis¬ 
tribution  are  the  acoustic  wave  equations 

Pt  c*V  •  (pjj)  =  0 
Vt  +  =  F(x,t), 

where  x  =  (xi,  xj,  X3),  P  is  the  pressure  distribution, 
V  =  {vi,v-j,V3)  is  the  velocity,  p  is  the  density,  and 
c  is  the  speed  of  sound  in  the  medium.  However,  we 
actually  solve  the  equivalent  potential  equation 

U,t  -i-  A(x)  Ut  -  c2  V  ■  ( VU)  =  g(x,  t), 
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where  g  =  c^V  •  {pG)  and  Gt  =  F,  noting  that 
P  =  —Ut  (cf.  Sochacki  et  al.,  1990).  The  param¬ 
eters  p  and  c  are  determined  by  the  structure  of  the 
earth’s  interior  and  are  thus  obtained  from  the  inverse 
problem. 

For  measuring  the  displacement  at  the  earth’s  sur¬ 
face,  we  use  the  elastic  wave  equations 

pUtt  +  Aix)Ut  -  V  ■  Si  =  Fi(xJ) 
pVu  +  A{x)Vt  -  V  •  Sa  =  FaCx.t) 
pWu  +  A(x)Wt  -V  §3  =  F3(x, t), 

where  Si  =  {Sii,Si2,Si3)  and  Sij  =  Sij  -1-  2mc,j, 
and  6ij  is  the  Kronecker  function.  Moreover,  c,y  = 
and  ui  =  U,  U2  =  V,  U3  =  W, 

p  =  is  the  p-wave  velocity  and  a  =  y/mfp 

is  the  s-wave  velocity,  (cf.  Ewing,  Jardetzky,  and 
Press(1957)).  The  parameters  p,  <r,  and  p  are  also 
determined  by  the  earth’s  structure. 

The  term  A(x)  is  used  for  the  absorbing  bound¬ 
ary  conditions  rather  than  dissipation.  This  term  is 
equal  to  zero  in  the  interior,  since  we  are  modeling 
nondissipative  waves,  and  is  assigned  values  on  the 
boundary  of  the  model  so  that  waves  are  sufficiently 
decayed  to  reduce  the  amplitude  of  the  spurious  ref- 
elected  waves  off  of  the  boundary  (cf.  Sochacki  et  al., 
1987).  The  source  terms  are  localized  disturbances 
occuring  either  in  the  interior  or  at  the  surface. 

Although  the  problems  discussed  above  occur 
in  both  two-  and  three-dimensional  wave  propaga¬ 
tion,  we  address  two  dimensions  in  this  paper  be¬ 
cause  of  the  simplicity  of  visualization.  Also,  since 
the  problems  discussed  above  are  similar  in  acous¬ 
tic  and  elastic  wave  propagation,  we  only  consider 
two-dimensional  acoustic  wave  equations.  How¬ 
ever,  we  also  discuss  the  problems  specific  to  three- 
dimensional  wave  propagation  and  elastic  waves  as 
they  arise.  Also,  a  single  complicated  interface  can  il¬ 
lustrate  all  the  problems  that  occur  with  a  region  con¬ 
taining  many  complicated  interfaces;  thus  the  model 
we  analyze  deals  with  a  single  interface.  The  tech¬ 
nique  used  to  handle  the  numerical  calculations  re¬ 
quired  at  an  interface  is  taken  from  Sochacki  et  al. 
(1990).  The  distributed  memory  computation  allows 
two  different  programming  strategies  to  be  consid¬ 
ered  when  solving  the  finite  difference  equations.  We 
discuss  the  pros  and  cons  of  these  two  strategies  and 
present  timings  for  each. 

Of  course,  all  the  considerations  discussed  above 
must  be  displayed  visually  in  order  to  allow  proper 
analysis  and  interpretation.  The  graphics  can  be 


done  on  the  parallel  processing  machine  or  the  data 
sets  for  the  graphics  can  be  transferred  to  a  graphics 
workstation  that  has  greater  visualization  capabili¬ 
ties. 

The  graphics  produced  by  the  NCUBE/ten  are  a 
result  of  information  on  the  nodes  being  dumped  to 
an  8-bit  or  24-bit  graphics  board.  The  viewpoint  of 
these  snapshots  would  be  fixed.  Hence,  one  can  only 
analyze  a  given  data  set  until  this  data  is  replaced  by 
an  updated  set.  The  interactive  capabilities  would 
be  acheived  only  through  a  sequence  of  runs.  On 
the  other  hand,  if  data  sets  are  passed  through  the 
Sun4  to  a  high  performance  graphics  workstation  via 
efficient  data  compression  techniques,  the  process  of 
interactive  analysis  is  enhanced. 


The  Interface  Problems 

The  model  considered  is  1600  meters  by  1600  me¬ 
ters  and  contains  a  complicated  interface  at  an  aver¬ 
age  depth  of  800  meters  (see  Figure  1).  The  interface 
is  actually  at  a  depth  of  800  meters  at  each  horizontal 
8  meter  interval.  Between  these  points,  the  interface 
has  a  complicated  and  random  shape.  The  purpose 
of  this  choice  of  interface  configuration  is  to  show 
that  extremely  different  surface  seismograms  are  gen¬ 
erated  by  using  different  grid  sizes.  The  simulations 
(Table  1)  are  run  for  constant  size  square  grids  vary¬ 
ing  from  h=8  meters  down  to  2  meter.  The  p-wave 
velocity  above  the  interface  is  2000  m/s  and  the  den¬ 
sity  is  3200  kg/rr^ .  Below  the  interface,  the  p-wave 
velocity  is  6000  m/s  with  a  density  of  2600  kg/rn^. 
In  all  cases,  the  source  is  located  at  a  depth  of  400 
meters  and  a  horizontal  distance  of  800  meters.  The 
source  is  the  derivative  of  the  Gaussian  and  has  the 
form 

/(<)  =  A(t  -10)6-*“-*'’^'. 

For  stability,  the  time  step  At  must  be  chosen  to 
satisfy  the  CFL  condition:  At  <  where  c  is 

the  maximum  p-wave  velocity.  To  minimize  disper¬ 
sion,  the  source  should  act  for  t  =  2lo  seconds  and 
h  should  be  no  larger  than  where  a  is  the  min¬ 
imum  p-wave  velocity,  and  a  should  be  no  smaller 
than  201n(10)/t*.  In  Table  1,  we  give  the  parameters 
used  in  the  model  runs  to  show  the  discrepancies  of 
surface  seismograms.  The  time  shown  is  the  number 
of  time  steps  it  takes  for  the  wave  to  reflect  off  the 
interface  and  create  a  reasonable  surface  seismogram; 
this  is  approximately  .5  seconds. 
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Of  course,  the  memory  needed  and  the  number 
of  calculations  done  are  directly  proportional  to  the 
number  of  grids  and  time  steps.  For  /i  <  4  it  is  clearly 
seen  that  a  supercomputer  is  needed  to  carry  out  the 
calculations.  Therefore,  the  parallel  computer  is  an 
excellent  machine  to  display  the  importance  of  repre¬ 
senting  an  interface  accurately  for  the  inverse  prob¬ 
lem. 

In  the  field,  seismologists  use  source  frequencies 
from  2  Hz  to  100  Hz.  In  these  models,  a  derivative 
of  a  Gaussian  source  provides  a  frequency  equal  to 
j.  As  illustrated  in  Table  1,  a  higher  frequency  is 
possible  as  smaller  values  of  h  are  allowed.  A  parallel 
distributed  memory  architecture  provides  the  mem¬ 
ory  and  computational  power  needed  to  decrease  h 
to  realistic  physical  values.  In  addition,  with  the  de¬ 
crease  in  /i,  (T  is  used  to  maintain  continuity  in  time 
and  not  for  dispersion. 

All  the  above  models  use  a  single  interior  source 
which  requires  that  the  node  whose  grid  points  con¬ 
tain  the  source  information  does  substantially  more 
computations  on  startup  than  the  other  nodes,  re¬ 
sulting  in  load  imbalance.  This  increased  load  is  in¬ 
significant,  however,  compared  with  the  total  number 
of  computations  that  must  be  done  for  large  prob¬ 
lems,  i.e.  small  h.  Also,  to  highlight  the  differences 
in  the  seismograms  and  snapshots,  the  data  is  com¬ 
pressed  to  be  outputted  every  8  meters  so  that  the 
sizes  remain  the  same  in  all  three  runs.  Currently,  the 
damping  (or  ABC)  term  A{x)  is  nonzero  only  for  the 
30  outer  grid  points  of  the  bottom  and  edges.  Hence, 
in  the  interior  of  the  model,  there  is  a  memory  drain 
in  our  algorithms.  One  could'play  off  memory  versus 
performance  on  this  aspect,  but  we  do  not  address 
this  in  the  current  investigation. 

In  both  the  seismograms  and  the  snapshot  for  the 
entire  wavefield  it  is  easily  seen  that  completely  differ¬ 
ent  information  is  given  by  using  the  finer  grid  sizes. 
It  is  also  worth  noting  that  the  reflections  off  the  ran¬ 
dom  interfaces  shows  up  clearly  in  the  seismograms 
(Figure  2)  and  on  the  interface  in  the  snapshots  (Fig¬ 
ure  1)  .  Initially,  however,  it  appears  that  the  seis¬ 
mograms  are  similar.  Therefore,  the  importance  of 
having  accurate  forward  solvers  to  test  for  interfaces 
in  the  inverse  problem  is  highlighted  in  the  seismo¬ 
grams.  The  memory  capabilities  and  computational 
speed  of  the  NCUBE/ten  were  necessary  for  carrying 
out  this  numerical  experiment. 

Computing  Strategies 

There  are  two  basic  methods  for  locating  the  inter¬ 
faces  using  the  finite  difference  scheme  presented  in 
Sochacki  et  al.  (1990).  The  strategies  depend  on  how 
one  describes  the  p-wave  velocity  and  density  at  each 


grid  point.  The  two  different  strategies  arise  when  at¬ 
tempting  to  pass  these  parameters  to  the  equations 
being  calculated  in  the  most  efficient  manner  for  the 
distributed  memory.  One  method  (Method  A)  is  to 
create  a  matrix  that  contains  an  integer  for  each  grid 
point  indicating  its  region.  Each  p-wave  velocity  and 
density  is  assigned  the  corresponding  integer.  This 
means  that  we  need  a  matrix  equal  in  size  to  the 
number  of  grid  points  and  two  vectors  equal  in  size 
to  the  number  of  structures.  This  strategy  leads  to 
a  load  balancing  problem  at  the  interfaces,  since  for 
this  scheme  the  number  of  calculations  away  from 
an  interface  is  much  smaller  than  at  the  interface. 
For  models  that  are  not  too  complicated,  this  prob¬ 
lem  can  be  alleviated  by  strategic  assignment  of  the 
nodes  to  the  interfaces.  Also,  each  grid  point  must 
be  tested  to  see  if  it  lies  on  an  interface  at  each  time 
step.  This,  however,  does  not  cause  load  balancing 
problems;  it  just  increases  the  number  of  operations. 

The  second  method  (Method  B)  is  to  create  two 
matrices  both  of  which  are  equal  in  size  to  the  number 
of  grid  points.  One  matrix  contains  the  p-wave  ve¬ 
locity  at  each  grid  point  while  the  other  contains  the 
density  at  each  grid  point.  This  strategy  eliminates 
load  balancing  problems,  because  at  each  grid  point 
the  same  calculations  are  being  performed.  How¬ 
ever,  this  scheme  increases  the  amount  of  memory 
needed  and  the  total  number  of  calculations  done  sig¬ 
nificantly.  However,  if  there  is  a  large  number  of  in¬ 
terfaces  the  number  of  calculations  is  balanced  by  the 
testing  of  each  grid  point  in  the  former  scheme. 

We  have  presented  timings  for  both  of  these 
schemes  applied  to  the  model  with  the  single  interface 
for  three  grid  sizes.  In  Table  2  we  see  that  Method 
B  is  much  more  efficient  for  smaller  matrices.  This  is 
due  to  the  fact  that  the  extra  number  of  calculations 
in  this  method  for  smaller  problems  does  not  over¬ 
take  the  grid  point  checking  of  Method  A  and  that 
the  memory  requirements  are  still  minimal.  However, 
we  see  that  eis  the  model  size  increases  the  calculation 
times  for  Method  A  and  Method  B  approach  the  same 
magnitude.  In  addition,  we  present  timings  for  both 
schemes  applied  to  a  model  containing  seven  struc¬ 
tures  with  relatively  complicated  interfaces  to  test 
for  the  trade-off  in  this  situation  between  performing 
conditional  and  computational  instructions  (see  Fig¬ 
ure  3).  The  data  for  the  various  p-wave  speeds  (m/s) 
and  corresponding  densities  (kg/m^)  of  the  structures 
are  provided  in  Table  3. 

The  time  increment  A<  is  .00047  and  a  source 
is  used  which  has  a  duration  of  <  =  .126  and  a 
spread  a  =  3143.  We  have  presented  timings  for 
this  medium  comparing  both  Methods  A  and  B  in 
Tables  4-5,  Table  4  gives  the  timings  which  include 
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the  initial  startup  computations,  including  the  source 
calculations,  while  Table  5  gives  the  timings  for  later 
time  only. 

These  two  tables  provide  a  test  for  the  trade-off 
between  performing  conditional  and  computational 
instructions  in  this  situation.  Here  again,  we  see 
that  the  calculation  times  are  of  the  same  magnitude. 
However,  in  all  the  cases  Method  B  is  faster  than 
Method  A,  and  this  suggests  that  a  clever  method 
for  assigning  the  nodes  to  the  interface  calculations 
is  appropriate. 

We  also  note  that  in  three  dimensions,  the  sizes  of 
the  matrices  for  these  strategies  are  increased  in  size 
by  a  factor  of  the  number  of  grid  points  in  the  extra 
dimension;  additionally,  for  elastic  wave  simulation,  a 
matrix  for  the  s-wave  velocity  is  needed  in  the  latter 
strategy. 

Conclusions 

The  locating  of  interior  structures  in  the  earth’s  in¬ 
terior  is  one  of  the  important  challenges  of  geophysics. 
One  method  of  attacking  this  problem  is  using  the 
acoustic  and  elastic  wave  equations  in  two  and  three 
dimensions  on  distributed  memory  machines.  In  this 
paper  we  have  presented  two  methods  for  accom¬ 
plishing  this  and  have  presented  data  from  these  two 
methods  performed  on  an  NCUBE/ten.  There  are 
many  more  tests  that  can  be  run  on  the  two  strate¬ 
gies  presented  here,  and  these  need  to  be  carried  out; 
however,  the  groundwork  has  been  layed. 

The  data  we  have  presented  are  for  2D  acoustic 
wave  analysis,  but  the  ideeis  can  be  carried  to  2D  elas¬ 


tic  wave  analysis  and  3D  in  a  similar  manner.  The 
main  problem  in  3D  is  that  the  structures  become 
more  complicated  and  the  number  of  calculations  is 
greatly  increased.  However,  the  work  done  here  is 
currently  being  extended  to  3D,  and  the  major  dif¬ 
ferences  are  in  visualization  and  the  fact  that  all  the 
nodes  of  the  NCUBE/ten  must  be  used. 
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Table  1.  Media  Parameters 


h 

mn 

number  of 
grid  points 

t 
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frequency 
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time  steps 
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200  X  200 

3,131 

7.95  Hz. 

725 

4 

R 1  vSm 

400  X  400 

■El 

12,415 

15.87  Hz. 

1700 

2 

800  X  800 

.063 

31.74  Hz. 

Table  2.  Single  Processor  Timings  Table  3.  Subregion  parameters 
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1000 

2000 
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1000 

2800 
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5800 
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Abstract 

An  implementation  is  presented  for  JAC3D  on  a 
massively  parallel  hypercube  computer.  JAC3D,  a 
three  dimensional  finite  element  code  developed  at 
Sandia,  uses  several  hundred  hours  of  Cray  time 
each  year  in  solving  structural  analysis  problems. 
Two  major  areas  of  investigation  are  discussed;  (1) 
the  development  of  general  methods,  data  struc¬ 
tures,  and  routines  to  communicate  information  be¬ 
tween  processors,  and  (2)  the  implementation  and 
evaluation  of  four  algorithms  to  map  problems  onto 
the  node  processors  of  the  hypercube  in  a  load- 
balanced  fashion.  The  performance  of  JAC3D  on 
the  NCUBE/ten  is  compared  with  that  on  a  Cray 
X-MP;  the  NCUBE/ten  version  presently  takes  20% 
more  compute  time  than  the  Cray.  On  a  larger  sim¬ 
ulation  which  used  more  of  the  NCUBE’s  memory, 
the  NCUBE/ten  would  take  less  compute  time  than 
the  Cray.  Current  activity  on  the  newer  NCUBE 
2  hypercube  is  summarized  which  should  lead  to  an 
order  of  magnitude  improvement  in  run-time  perfor¬ 
mance  for  the  massively  parallel  solution  of  struc¬ 
tural  analysis  problems. 

Introduction 

In  this  paper  we  discuss  the  implementation  of 
JAC3D,  a  three  dimensional  finite  element  code 
which  uses  a  nonlinear  Jacobi  preconditioned  con¬ 
jugate  gradient  method  to  solve  large  displacement, 
large  strain,  temperature  dependent,  and  nonlinear 
material  structural  analysis  problems,  on  a  mas¬ 
sively  parallel  computer,  the  NCUBE/ten  hyper¬ 
cube.  This  code  was  developed  at  Sandia  National 
Laboratories  where  it  uses  several  hundred  hours  of 
Cray  time  each  year.  We  note  that  the  hypercube 
implementation  is  complete  in  that  a  user  has  the 
same  user  interface  and  simulation  options  on  the 
Cray  and  the  hypercube. 

‘This  work  was  partially  supported  by  the  Applied  Mathe¬ 
matical  Sciences  program,  U.S.  Department  of  Energy,  Office 
of  Energy  Research,  and  was  performed  at  Sandia  National 
Laboratories  which  is  operated  for  the  U.S.  Department  of 
Energy  under  contract  number  DE-AC04-76DP00789. 


Two  major  implementation  issues  are  discussed 
below.  The  first  is  the  development  of  routines  to 
communicate  information  between  the  node  proces¬ 
sors  and  between  the  host  and  the  node  processors. 
The  reason  these  are  nontrivial  is  that  the  finite  el¬ 
ement  mesh  is  not  necessarily  regular  or  regularly 
numbered.  Routines  are  included  that  determine 
what  information  each  processor  sends  or  receives 
at  each  communication  step  and  with  which  proces¬ 
sors  it  is  communicating.  The  second  area  is  the  de¬ 
velopment  of  algorithms  to  map  a  problem  onto  the 
node  processors  of  the  hypercube  in  a  load-balanced 
fashion.  We  will  present  and  compare  several  map¬ 
ping  methods  that,  to  date,  have  been  executed  on 
a  SUN  workstation. 

Compute  times  are  within  20%  of  the  Cray  X-MP 
for  a  production  simulation  with  89,043  equations. 
The  NCUBE/ten  can  easily  handle  a  problem  four 
times  larger;  such  a  simulation  would  be  faster  on 
the  NCUBE/ten  relative  to  the  Cray.  Preliminary 
benchmarks  oa  the  NCUBE  2  indicate  that  the  SUN 
front  end  reduces  I/O  time  by  at  least  a  factor  of  ten 
and  that  NCUBE  2  processors  are  currently  a  factor 
of  four  faster  than  the  first-generation  processors. 
Therefore,  the  code  should  run  several  times  faster 
on  the  NCUBE  2  than  on  the  Cray  X-MP.  We  are 
also  working  on  parallelization  of  selected  mapping 
methods  and  on  a  system  to  display  JAC3D  results 
from  the  NCUBE  2  hypercube  on  a  Stellar  graphics 
workstation. 

Implementation  Issues 

Overview 

JAC3D  is  a  three  dimensional  finite  element  code 
which  uses  a  nonlinear  Jacobi  preconditioned  conju¬ 
gate  gradient  (PCG)  method  to  solve  large  displace¬ 
ment,  large  strain,  temperature  dependent,  and  non¬ 
linear  material  structural  analysis  problems  [2].  The 
serial  version  of  the  code  reads  in  three  data  files;  a 
control  file  containing  material  constants  and  num¬ 
bers  such  as  the  maximum  number  of  iterations, 
an  input  file  which  contains  the  finite  element  de- 
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scription  of  the  problem,  and  a  file  which  gives  the 
temperature  at  each  node  point  for  each  load  step. 
JAC3D  then  creates  an  output  file  and  an  additional 
file  used  for  plotting  the  output. 

In  implementing  JAC3D  on  our  hypercube,  an 
NCUBE/ten,  we  have  added  a  third  input  file  which 
contains  the  order  of  the  hypercube  being  used  and 
a  mapping  of  the  elements  and  nodes  of  the  prob¬ 
lem  onto  the  node  processors.  The  NCUBE/ten  is  a 
1024  node  hypercube  which  has  0.5  MBytes  of  mem¬ 
ory  on  each  processor. 

It  was  necessary  on  the  NCUBE/ten  to  divide  the 
original  code  into  a  host  processor  code  and  a  node 
processor  code.  (This  code  division  can  be  avoided 
on  the  newer  NCUBE  2  hypercube.)  The  node  pro¬ 
cessor  code  corresponds  to  the  call  to  the  solver  in 
the  original  code,  while  the  host  code  handles  the 
input  and  output.  The  host  code  begins  by  reading 
the  input  files  and  doing  the  preprocessing  that  is 
necessary  on  the  data.  When  it  is  ready  to  call  the 
solver,  it  allocates  a  hypercube  of  the  desired  dimen¬ 
sion  and  starts  the  solver  on  the  node  processors.  In 
this  way,  running  the  solver  on  the  node  processors 
is  similar  to  calling  the  solver  as  a  subroutine  with 
the  passed  variables  now  being  communicated  be¬ 
tween  the  host  and  node  processors. 

PCG  and  Finite  Element  Methods 

The  iteration  matrix  is  calculated  at  each  itera¬ 
tion  as  it  is  needed,  which  avoids  using  the  mem¬ 
ory  which  would  be  required  to  store  the  entire  ma¬ 
trix.  The  matrix  is  calculated  element  by  element, 
so  some  information  about  each  of  the  elements  has 
to  be  kept.  This  is  done  by  dividing  the  elements 
among  the  processors  such  that  each  element  is  as¬ 
signed  to  one  processor.  In  this  way,  there  are  no 
duplicate  calculations. 

Each  element  has  a  list  of  nodes  which  are  associ¬ 
ated  with  it  and  allocates  storage  for  all  of  these 
nodes.  In  this  way  each  node  may  be  allocated 
space  in  more  than  one  processor  but  the  node  will 
be  assigned  to  only  one  processor.  That  processor 
is  responsible  for  maintaining  the  correct  value  of 
the  variables  associated  with  the  node  by  collecting 
partial  values  of  the  variables  associated  with  that 
node  from  other  processors  and  providing  these  cor¬ 
rect  values  to  the  other  processors  when  needed.  On 
each  processor,  the  nodes  which  are  assigned  to  it 
are  numbered  first,  followed  by  the  nodes  for  which 
the  processor  needs  values  but  which  are  assigned  to 
other  processors.  In  this  way,  each  processor  locally 
numbers  the  nodes  and  elements  that  it  has. 

In  the  solution  algorithm,  the  unknowns  at  the 
nodes  are  updated  in  two  ways.  Some  calculations, 


such  as  the  calculation  of  the  residual  vector,  are 
done  element  by  element  [6].  In  order  for  the  pro¬ 
cessors  to  update  the  unknowns  associated  with  an 
element,  some  values  of  other  variables  at  the  asso¬ 
ciated  nodes  need  to  be  communicated  to  that  pro¬ 
cessor.  As  each  element  is  used,  the  unknowns  at 
the  nodes  associated  with  that  element  are  updated. 
Since  each  element  appears  in  only  one  processor, 
several  processors  will  generate  updates  to  shared 
variables,  which  requires  communication  of  partial 
results  so  these  updates  can  be  combined  to  form 
the  final  result. 

The  second  way  that  unknowns  at  a  node  get  up¬ 
dated  is  by  the  processor  which  to  which  that  node 
is  assigned.  An  example  of  this  is  the  calculation  of 
the  new  direction  vector  from  a  linear  combination 
of  the  previous  direction  vector  and  the  residual  vec¬ 
tor. 

Initial  host-to-node  Communication 

The  host  processor  communicates  with  the  node 
processors  by  communicating  only  with  node  0.  Any 
data  that  the  host  sends  to  the  node  processors  is 
sent  to  processor  0  which  then  broadcasts  the  infor¬ 
mation  to  the  rest  of  the  processors  by  means  of  a 
fanout  algorithm  using  a  minimal  spanning  tree  of 
the  hypercube  rooted  at  node  processor  0  [4].  In 
the  fanout  algorithm,  successive  dimensions  of  the 
hypercube  are  used.  In  each  stage,  all  of  the  active 
processors  send  information  to  their  neighbor  in  that 
dimension.  As  those  processors  receive  information, 
they  become  active  and  will  send  information  in  the 
next  stage. 

The  host  processor  starts  by  sending  the  node 
processors  a  message  which  contains  startup  infor¬ 
mation  such  as  the  total  number  of  elements  and 
nodes  in  the  problem  and  the  maximum  number  of 
iterations.  The  node  processors  then  use  this  in¬ 
formation  to  set  up  some  temporary  arrays.  The 
host  processor  then  reads  in  the  problem  mapping 
of  the  elements  and  sends  that  information  to  the 
node  processors.  This  allows  the  node  processors  to 
determine  how  many  elements  they  have  and  allo¬ 
cate  space  for  some  arrays.  The  host  processor  then 
sends  the  list  of  nodes  which  are  associated  with 
each  of  the  elements  and  the  mapping  of  the  nodes 
to  the  node  processors.  The  node  processors  store 
the  portion  of  the  list  of  nodes  which  are  associated 
with  their  elements  and  then  use  that  with  the  map¬ 
ping  of  the  nodes  to  determine  the  number  of  nodes 
they  need  storage  for  and  to  set  up  communication 
with  other  nodes. 
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Data  Structures  for  Interprocessor  Communication 
Next,  the  node  processors  set  up  the  communica^ 
tion  which  they  do  during  the  calculations.  Using 
the  list  of  which  processor  has  each  node,  a  processor 
constructs  a  list  of  nodes  for  which  it  needs  values  of 
variables  but  which  are  assigned  to  other  processors. 
This  receive  list  of  nodes  is  then  sorted  by  processor 
and  the  processor  builds  an  index  to  this  list  con¬ 
sisting  of  the  processor  to  communicate  with,  the 
number  of  nodes  which  have  to  be  communicated, 
and  a  starting  index  into  the  list.  This  is  illustrated 
in  Figure  1.  When  the  list  is  sorted  by  processor, 
it  is  ordered  by  placing  the  processors  in  descending 
order  of  their  distance  from  the  processor  in  terms 
of  message  hops.  In  this  way,  messages  which  will 
take  the  longest  time  to  be  communicated  will  be 
sent  first.  In  our  experiments,  this  message  order 
cut  down  the  execution  time  of  the  algorithm. 


Figure  1.  Communication  Data  Structure 


Each  processor  sends  the  processors  which  it 
needs  information  from  the  list  of  nodes  it  needs 
from  that  processor.  Each  processor  uses  this  infor¬ 
mation  it  receives  to  construct  a  list  similar  to  its 
receive  list,  a  send  list  which  is  used  to  send  correct 
values  of  variables.  The  communication  routines  use 
this  general  data  structure  since  the  problems  to  be 
solved  are  generally  irregular  and  have  an  irregular 
numbering  of  the  nodes. 

When  the  processors  need  to  communicate  the 
value  of  a  variable,  they  use  the  send  list  to  send 
messages  to  other  processors  and  the  receive  list  to 
receive  messages  from  other  processors.  When  the 
values  of  an  array  need  to  be  communicated,  each 
processor  sends  a  message  to  each  of  the  processors 
in  its  send  list  of  processors.  The  processor  numbers 
in  the  send  list  are  used  successively  and  an  index 
into  the  node  numbers  being  sent  is  maintained.  For 
each  processor  in  the  list,  the  node  numbers  to  be 
sent  are  determined  by  taking  them  from  the  list 
of  nodes  starting  at  the  index.  Since  the  number 


of  nodes  to  be  communicated  to  e8w;h  processor  is 
stored,  that  many  nodes  numbers  are  used  to  take 
information  from  the  array  to  be  sent  and  put  into 
a  message  array.  This  process  uses  all  of  the  data 
structure  as  illustrated  in  Figure  1  except  for  the 
array  of  indexes  into  the  list  of  nodes  to  be  com¬ 
municated.  The  message  array  is  then  sent  and  the 
index  is  incremented  by  the  number  of  nodes  which 
were  sent. 

When  a  processor  receives  a  message,  it  looks  up 
the  processor  number  in  its  receive  array  and  the 
number  of  nodes  that  are  being  communicated  and 
the  starting  position  in  the  array.  It  uses  that  in¬ 
formation  to  put  the  values  in  the  message  into  the 
variable  array  in  the  right  places.  Since  it  can  be 
seen  that  the  communication  involved  with  the  pro¬ 
cess  of  communicating  correct  values  of  variables  at 
a  node  between  the  processors  is  the  inverse  of  the 
the  process  of  communicating  partial  values  of  vari¬ 
ables  at  a  node  between  processors,  the  receive  list 
is  used  to  send  partial  results  to  other  processors 
and  the  send  list  is  then  used  to  receive  those  re¬ 
sults  which  are  added  to  the  local  results  to  get  the 
correct  value.  In  the  case  of  communicating  partial 
values,  the  final  result  does  not  necessarily  need  to 
be  sent  to  the  other  processors  involved  since  they 
may  not  need  this  value. 

The  other  case  in  which  interprocessor  communi¬ 
cation  has  to  be  done  is  the  case  of  inner-products. 
This  is  done  by  the  standard  bidirectional  exchanges 
of  partial  information  along  successive  dimensions  of 
the  hypercube  with  the  addition  of  partial  results  af¬ 
ter  each  exchange  [6]. 

Input:  Large  Vectors 

After  the  node  processors  have  allocated  space  for 
the  vectors  that  they  store  and  have  set  up  their 
communication  schemes,  the  host  processor  can  send 
them  the  initial  vector  information(e.g.  temper¬ 
atures).  This  information  is  sent  to  processor  0 
which  then  broadcasts  it  to  the  other  node  proces¬ 
sors.  Each  processor  then  takes  the  part  of  the  vec¬ 
tor  which  it  needs  and  stores  it  in  its  memory.  The 
maximum  size  of  a  message,  the  size  of  the  message 
buffers  on  the  node  processors,  and  the  size  of  an  ar¬ 
ray  on  the  host  processor  are  each  limited,  so  large 
messages  have  to  be  read  in  to  the  host  and  sent 
to  the  node  processors  in  pieces.  After  each  piece 
of  the  message  is  received  by  the  node  processors, 
node  processor  0  sends  a  message  back  to  the  host 
to  allow  the  host  to  send  the  next  piece.  This  proce¬ 
dure  prevents  message  buffer  overflow  on  the  node 
processors. 
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Host  Activity  During  Node  Computation 

At  this  point  the  node  processors  start  calculating 
and  the  host  processor  waits.  Since  the  node  proces¬ 
sors  have  to  output  results  and  read  additional  input 
such  as  the  temperature  of  the  nodes  at  the  begin¬ 
ning  of  each  load  step,  the  host  processor  has  to  be 
able  to  call  the  appropriate  subroutine  to  interact 
with  the  node  processors.  It  does  this  by  waiting 
to  receive  a  message  and,  based  on  the  type  of  the 
message  received,  either  calls  the  appropriate  sub¬ 
routine,  prints  out  the  appropriate  error  message, 
or  deallocates  the  hypercube  and  quits.  Since  node 
0  hats  a  copy  of  any  scalar  data  which  has  to  be 
communicated  back  to  the  host  processor  to  run  the 
subroutine,  this  information  is  included  in  the  mes¬ 
sage  which  tells  the  host  processor  which  subroutine 
to  run.  In  summary,  execution  is  controlled  by  the 
node  processors  in  this  part  of  the  calculation. 

Output 

The  output  from  the  node  processors  is  handled 
by  a  fanin  algorithm,  in  which  the  information  to  be 
output  is  sent  to  node  processor  0  which,  in  turn, 
sends  the  information  to  the  host.  The  fanin  algo¬ 
rithm  is  the  inverse  of  the  fanout  algorithm.  At  each 
stage,  half  of  the  active  processors  send  a  message 
to  the  other  half.  The  processors  which  receive  a 
message  are  the  active  processors  for  the  next  stage. 
As  with  input,  output  of  large  messages  is  also  done 
in  pieces.  In  order  to  output  arrays  in  the  proper 
order,  each  processor  has  a  list  of  the  global  order 
number  of  the  nodes  which  are  assigned  to  it.  Each 
piece  of  the  array  is  assembled  in  the  global  order 
and  sent  to  the  host  processor. 

Problem  Mapping 

In  order  to  implement  JAC3D  on  the  hypercube, 
we  had  to  provide  for  the  automated  mapping  of 
large  problems  onto  the  hypercube.  We  have  used 
four  mapping  methods.  The  first  is  a  recursive  bisec¬ 
tion  method  developed  for  problems  on  rectangular 
grids  by  Berger  and  Bokhari  [1].  In  this  method,  the 
problem  grid  is  divided  into  two  rectangles  along  a 
line  of  the  grid.  This  division  is  repeated  recursively 
to  each  of  the  rectangles  until  the  desired  number  of 
sets  of  unknowns  is  created.  This  method  is  eas¬ 
ily  adapted  for  three-dimensional  rectangular  grids 
[3].  This  method  has  the  disadvantage  that  it  has 
the  potential  for  load  imbalance,  since  each  set  is 
divided  along  a  line  of  the  grid  and,  therefore,  the 
two  resulting  sets  may  not  be  the  same  size. 

From  this  algorithm,  we  have  developed  a  second 
algorithm  which  uses  recursive  bisection  for  irregu¬ 


lar  regions  in  three  dimensions.  The  first  step  is  to 
sort  the  nodes  of  the  grid  in  the  x,  y,  and  z  direc¬ 
tions.  At  each  stage  of  the  mapping,  a  direction  is 
chosen  and  each  set  in  the  mapping  is  divided  into 
two  equal  or  nearly  equal  sets  based  on  the  index  in 
the  sorted  list  for  the  given  direction  of  each  node 
in  the  set.  For  example,  given  a  set  S  with  n  nodes 
which  is  being  divided  into  sets  SI  and  S2  along  the 
x  direction,  the  first  n/2  nodes  of  set  S  in  the  sorted 
list  of  nodes  for  the  x  direction  are  placed  in  set  S 1 
with  the  remainder  put  in  set  S2.  In  this  way,  the 
sets  at  the  final  stage  of  the  mapping  will  have  an 
approximately  equal  number  of  nodes. 

The  third  algorithm  was  developed  by  Kernighan 
and  Lin  [8).  It  is  a  iterative  graph-based  algorithm 
which  starts  with  a  set  which  has  been  arbitrarily 
divided  into  two  equal  sized  pieces  and  exchanges 
nodes  in  order  to  minimize  the  number  of  edges  con¬ 
necting  the  two  pieces  of  the  set.  At  each  iteration, 
it  looks  at  all  of  the  unmarked  nodes  in  each  of  the 
two  pieces  of  the  set  and  marks  the  pair  which,  if  ex¬ 
changed,  would  minimize  the  number  of  edges  con¬ 
necting  the  two  pieces.  After  all  of  the  nodes  are 
marked,  then  the  minimum  number  of  pairs  to  cre¬ 
ate  the  maximum  change  are  exchanged.  The  pro¬ 
cess  is  repeated  until  nothing  further  can  be  gained 
by  swapping  nodes. 

The  fourth  algorithm  that  we  used  is  a  graph- 
based  algorithm  developed  by  Vaughan  [9].  At  each 
stage,  each  set  is  divided  into  two  equal  parts  by 
the  use  of  level  sets.  The  first  step  to  divide  a 
set  into  two  pieces  is  to  find  a  pseudo-diameter  of 
the  graph  of  the  grid  [5].  A  rooted  level  structure 
is  constructed  from  each  endpoint  of  the  pseudo¬ 
diameter.  The  nodes  are  divided  into  two  sets  ac¬ 
cording  to  which  endpoint  they  are  closer  to.  Each 
rooted  level  structure  will  have  a  set  of  level  sets  and 
the  number  of  the  level  set  a  node  is  in  is  a  measure 
of  its  distance  from  the  root  of  the  level  structure. 
Points  which  are  equidistant  from  both  endpoints 
are  assigned  to  a  set  so  that  the  sizes  of  the  sets  are 
equalized. 

By  using  the  endpoints  of  a  pseudo-diameter  as 
starting  points,  we  seek  to  construct  level  structures 
with  small  level  sets  thus  providing  a  smaller  set 
of  nodes  on  the  boundary  when  the  set  is  divided 
into  two  pieces.  This  is  similar  to  the  motivation 
for  using  level  structures  in  reordering  equations  for 
solution  by  direct  methods. 

For  the  two  graph-bj«ed  algorithms,  the  number 
of  sets  at  each  stage  of  the  division  is  doubled  from 
n  to  2n  and  the  sets  are  divided  according  to  their 
set  number  in  a  gray  code  fashion.  When  the  first 
set,  set  0,  is  divided  into  two  sets,  these  sets  arc 
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numbered  0  and  n  arbitrarily.  After  set  t  is  divided, 
with  0  <  *  <  n,  the  two  resulting  sets  are  numbered  i 
and  i+n.  The  choice  of  which  set  is  to  be  numbered  i 
is  determined  by  which  numbering  gives  the  smallest 
cost  for  communication  with  the  sets  which  have 
already  been  divided. 

For  each  of  the  algorithms,  the  nodes  are  divided 
among  the  processors.  However,  with  our  solution 
method,  the  elements  also  have  to  be  mapped  to 
the  processors.  Each  of  the  mappings  above  work 
by  doubling  the  number  of  processors  in  the  map¬ 
ping  at  each  stage.  At  each  stage,  half  of  the  nodes 
and  half  of  the  elements  etssigned  to  a  processor  are 
assigned  to  a  new  processor.  Each  element  stays  in 
its  processor  or  moves  to  the  new  processor  based  on 
which  of  the  two  processors  has  more  of  its  nodes. 
Ties  are  settled  in  such  a  way  as  to  keep  the  number 
of  elements  assigned  to  the  two  processors  even. 

Results 

We  solved  two  problems  with  JAC3D  on  the  hy¬ 
percube.  The  first  is  a  rectilinear  block  with  three 
materials,  450  elements,  and  810  nodes.  The  second 
is  a  solder  analysis  problem  of  a  28  pin  integrated 
circuit  on  a  PC  board.  It  has  four  materials,  22932 
elements,  and  29681  nodes  and  is  very  irregular  (Fig¬ 
ure  2).  Since  we  are  solving  for  the  displacements  in 
three  directions,  there  are  89043  unknowns  in  this 
problem.  Symmetry  is  used  in  the  x  and  y  direc¬ 
tions  to  decrease  problem  size.  Note  that  most  of 
the  elements  and  nodes  are  in  the  pins  connecting 
the  PC  board  to  the  integrated  circuit. 

Table  1  shows  the  execution  times  for  the  pro¬ 
gram  on  the  first  problem  using  the  four  mapping 
methods  as  well  as  a  mapping  constructed  by  hand. 
The  problem  would  not  fit  on  one  processor,  or  even 
two  processors  in  the  case  of  the  Berger  and  Bokhari 
mapping.  In  the  tables,  hand  is  the  hand  mapping, 
graph  is  the  graph-based  method  by  Vaughan,  kl  is 
the  Kernighan  and  Lin  algorithm,  bb  is  the  Berger 
and  Bokhari  algorithm,  and  rb  is  the  recursive  bisec¬ 
tion  method  based  on  a  modification  of  the  Berger 
and  Bokhari  algorithm.  The  execution  times  only 
include  the  node  processor  time  and  do  not  include 
the  preprocessing  time  for  the  host.  In  the  best  case, 
we  got  a  speedup  of  4 1  on  going  from  two  to  256  pro¬ 
cessors.  This  is  encouraging  considering  that,  on  256 
processors,  each  processor  had  two  or  fewer  elements 
and  four  or  fewer  nodes. 

Table  2  shows  the  time  to  construct  the  mappings 
on  a  SUN  3.  These  times  are  smaller  by  a  factor  of 
two  or  three  than  the  time  the  division  would  take 


Figure  2.  Solder  Analysis  Problem 


Table  1. 

Execution  time  for  small  problem 
(seconds) 

cube 

dim 

Division  Method 

hand 

graph 

kl 

bb 

rb 

1 

1747 

1751 

1751 

- 

1752 

2 

885 

910 

911 

1003 

909 

3 

463 

473 

468 

562 

486 

4 

252 

2.54 

258 

313 

271 

5 

139 

141 

150 

192 

145 

6 

86.8 

84.1 

108 

117 

85.3 

7 

- 

60.9 

65.5 

82.9 

57.8 

8 

- 

45.9 

48.0 

55.8 

42.9 

on  one  node  processor  of  the  NCUBE/ten.  The  two 
graph-based  methods  are  slowest  while  the  Berger 
and  Bokhari  algorithm  is  the  fastest.  Note  that  a 
large  portion  of  time  for  the  Kernighan  and  Lin  al¬ 
gorithm  is  spent  in  the  first  division. 

Table  3  shows  the  execution  time  for  the  solder 
analysis  problem  on  the  NCUBE.  The  time  includes 
all  of  the  node  time  from  the  time  the  host  communi¬ 
cates  the  problem  to  the  nodes  and  does  not  include 
the  host  preprocessing  time.  The  Kernighan  and 
Lin  algorithm  produces  the  mapping  which  executes 
the  fastest  while  the  other  two  methods  are  about 
equal.  As  Table  4  shows,  however,  construction  of 
the  Kernighan  and  Lin  mapping  is  the  slowest  by  at 
least  a  factor  of  ten. 

Table  5  compares  the  solder  analysis  problem  run 
on  both  the  NCUBE  and  the  Cray  X-MP.  Here, 
compute  time  for  the  NCUBE  is  just  the  node  pro- 
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Table  2. 

ng  time  for  small  problem 
seconds  on  a  SUN  3) 

cube 

dim 

Division  Method 

graph 

kl 

bb 

rb 

1 

3.0 

30.3 

2.0 

2.4 

2 

4.3 

40.8 

2.0 

2.5 

3 

6.6 

46.4 

2.1 

2.8 

4 

10.4 

52,5 

2.1 

3.1 

5 

14.5 

58,0 

2.1 

4.0 

6 

21.5 

66.3 

2.2 

5.4 

7 

29.1 

74.9 

2.3 

7.9 

8 

40.0 

84.0 

2.4 

14.3 

Table  3. 

Execution  time  for  large  problem 
(seconds) 

cube 

Division  Method 

dim 

kl 

graph 

rb 

8 

8243 

9098 

9089 

9 

5541 

6217 

6331 

10 

4312 

4602 

5144 

cessor  time  without  any  of  the  overhead  of  commu¬ 
nicating  with  the  host  between  load  steps,  while  the 
total  time  is  the  time  from  start  to  finish  on  the 
host.  The  total  execution  time  for  the  NCUBE/ten 
including  all  of  the  host  time  was  6100  seconds.  This 
shows  that  the  processing  time  on  the  NCUBE/ten 
is  comparable  to  that  on  the  Cray  X-MP  but  the 
I/O  time  which  is  a  result  of  the  host  processor  of 
the  NCUBE/ten  causes  the  total  execution  time  on 
the  NCUBE/ten  to  be  much  larger  than  that  of  the 
Cray.  When  we  implement  this  code  on  the  NCUBE 
2  with  the  SUN  front  end,  the  ratio  of  the  total  time 
to  compute  time  should  improve  dramatically. 


Table  5. 

NCUBE  vs,  Cray  X-MP 
(seconds) 


Compute  Time 

NCUBE/ten 

2197 

Cray  X-MP 

1661 

host  processor  I/O).  Compute  times  are  within  20% 
of  the  Cray  X-MP  for  a  production  simulation  with 
89,043  equations.  The  NCUBE/ten  can  easily  han¬ 
dle  a  problem  four  times  larger;  such  a  simulation 
would  be  faster  on  the  NCUBE/ten  relative  to  the 
Cray.  The  hypercube  code  is  complete:  a  user  sees 
the  same  user  interface  and  simulation  options  on 
the  Cray  and  the  hypercube. 

We  are  now  implementing  this  code  on  the 
NCUBE  2  and  its  SUN  front  end.  Preliminary 
benchmarks  indicate  that  the  SUN  front  end  reduces 
I/O  time  by  at  least  a  factor  of  ten  and  that  NCUBE 
2  processors  are  currently  a  factor  of  four  faster  than 
the  first-generation  processors.  Therefore,  the  code 
should  run  several  times  faster  on  the  NCUBE  2 
than  on  the  Cray  X-MP.  We  are  also  working  on  a 
system  to  display  JAC3D  results  from  the  NCUBE 
2  on  a  Stellar  graphics  workstation. 

Several  promising  methods  have  been  imple¬ 
mented  and  compared  for  mapping  general  problems 
onto  a  hypercube.  Clearly,  the  methods  should  be 
judged  by  both  the  quality  of  their  mappings  and 
the  time  it  takes  to  do  the  mapping.  We  plan  to 
implement  selected  mapping  algorithms,  including 
the  simple  graph  method  and  the  recursive  bisec¬ 
tion  method,  in  parallel  on  the  NCUBE  2.  We  ex¬ 
pect  that  some  of  the  mapping  algorithms  will  par¬ 
allelize  well  and  that  the  time  used  for  mapping  will 
ultimately  be  a  small  part  of  the  overall  execution 
time. 


Discussion  and  Conclusions 

We  have  implemented  a  large  3D  finite  element 
code  on  the  NCUBE/ten  hypercube  and  have  ob¬ 
tained  supercomputer-class  performance  (except  for 


Table  4. 

Mapping  time  for  large  problem 
(seconds  on  a  SUN  3) 

cube 

Division  Method 

dim 

kl 

graph 

rb 

8 

49193 

2995 

684 

9 

50072 

3751 

1242 

10 

50775 

4894 

2283 
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Abstract 

This  paper  describes  initial  efforts  to  convert  the 
structural  analysis  program,  ABAQUS,  to  run  on  the 
iPSCI2.  Efforts  were  limited  to  the  main  program 
since  it  is  much  more  demanding  of  computational 
resources  than  either  the  pre-  or  post-processor.  The 
main  program,  in  turn,  can  be  viewed,  from  the 
perspective  of  parallel  processing,  as  consisting  of  two 
steps  with  separate  domain  decompositions:  the 
generation  of  submatrix  data  and  the  solution  of  a 
large  system  of  linear  equations  (typically,  out  of 
core). 

Parallel  generation  of  submatrix  data  was  achieved 
by  distributing  the  individual  finite  elements  among 
the  processing  nodes. 

The  equation  solver  used  by  ABAQUS  is  an 
implementation  of  the  wave-front  method.  The 
complex  nature  of  factorization  for  the  wave-front 
method  and  the  frequent  need  to  do  disk  I/O  dictated 
the  use  of  a  hybrid  decomposition  where  one  processor 
executed  non-numerical  operations  associated  with 
factorization  while  the  other  processors  assisted  the 
manager  by  performing  all  calculations  associated 
with  Gaussian  elimination. 

Introduction 

ABAQUS  [1]  is  a  large  commercial  finite  element 
code  extensively  used  for  structural  analysis.  It  is 
written  and  developed  for  sequential  computers  and 
presents  major  challenges  to  parallel  processors.  In 
addition  to  the  usual  problems  of  load  balancing  and 
minimization  of  communication  between  processors, 
a  conversion  of  ABAQUS  must  contend  with  an 
extensive  file  system  and  significant  disk  I/O.  It  is 


only  with  the  availability  of  the  Concurrent  I/O 
Facility  as  a  feature  of  the  iPSC/2  that  such  a 
conversion  could  be  considered. 

The  solution  of  a  structural  analysis  problem  with 
ABAQUS  is  typically  a  three-stage  process  with 
successive  execution  of  a  pre-processor,  main 
program  and  a  post-processor.  By  far  the  most 
demanding  in  terms  of  computational  resources  is 
the  main  program,  and  it  is  this  program  that  was 
modified  to  run  in  parallel. 

An  initial  examination  of  the  main  program 
it  filcnted  that  two  separate  decompositions  would  be 
•cctisary  to  achieve  efficient  performance.  The  first 
(3.  .:omposition  supports  parallel  generation  of 
submatrix  data,  while  the  second  is  needed  to  solve 
the  resulting  system  of  linear  equations.  Such  an 
approach  is  possible  since  matrix  generation  and 
matrix  solution  are  decoupled  processes  within 
ABAQUS. 

Parallel  ABAQUS  Assembly 

The  generation  of  element  stiffness  matrices  is  the 
most  intrinsically  parallel  part  of  any  finite  element 
code.  The  local  matrix  associated  with  each  element 
is  calculated  based  on  data  entirely  local  to  that 
element,  and  the  potential  parallelism  is  limited 
only  by  the  access  the  processor  has  to  the  data 
associated  with  a  given  element.  The  element  type, 
number  of  degrees  of  freedom,  material  data, 
element  connectivity  and  node  data  must  all  be 
known  to  the  processor  in  order  to  calculate  the 
element  stiffness  matrix  and  right-hand-side 
contribution. 
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In  ABAQUS  these  data  are  initially  generated  by  the 
pre-processor  and  transmitted  to  the  main  program 
through  a  communication  file.  Since  ABAQUS 
imposes  no  upper  limit  on  problem  size,  none  of  this 
information  is  kept  entirely  In  data  arrays  in  the 
main  routine.  Rather,  the  data  for  a  particular  data 
base  starts  in  an  array  in  memory  and,  if  it  is  too 
large  to  fit  entirely,  spills  over  into  a  data  base  file. 
In  addition,  a  portion  of  memory  is  devoted  to  a  pool 
of  software  controlled  cache  pages  so  that  frequently- 
used  data  from  the  data  base  files  may  be  accessed 
with  minimum  latency. 

This  data  base  for  finite  element  analysis  has  been 
carefully  designed  for  performance  on  a  wide  variety 
of  single  memory  machines.  To  properly  parallelize 
ABAQUS,  the  domain  decomposition  requires  the 
partitioning  of  those  data  bases.  The  element  and 
element  operator  data  bases,  for  example,  are  readily 
split  so  that  each  processor  has  a  data  base  of  its  own 
elements.  The  node  data  base  is  more  complicated 
since  nodes  on  the  boundary  between  regions 
belonging  to  different  processors  must  be  shared. 
This  is  probably  best  implemented  by  assigning  two 
node  data  bases  to  each  processor,  one  for  nodes 
internal  to  that  processor’s  region,  and  one  for  shared 
nodes.  Shared  nodes  would  then  require  special 
processing  to  update  the  displacements  after  each 
iteration. 

For  the  effort  described  in  the  paper,  only  the 
element  and  element  operator  data  bases  have  been 
split.  All  others  are  replicated  on  each  processor, 
with  a  special  procedure  required  at  the  end  of  the 
run  to  bring  one  copy  of  the  node  data  base  up  to 
date. 

Parallel  Assembly  Performance 
Since  assembly  is  intrinsically  parallel,  linear  speed¬ 
up  is  expected  as  more  processors  are  employed.  This 
speed-up  should  be  reduced  only  by  contention  for  I/O 
bandwidth  in  reading  the  communication  file  and 
writing  to  the  element  operator  files.  Table  1  shows 
the  speed  ups  for  the  Submatrix  Generation  on  a 
very  large  statics  problem.  These  tests  are  run  on  an 
iPSC/2-SX  with  8  I/O  nodes  and  8  disks. 


Even  better  speed  up  would  be  expected  when  the 
data  base  files  are  properly  decomposed  as  suggested 
above.  This  improved  decomposition  will  reduce  I/O 
traiTlc  by  reducing  needless  replication  of  data  and 


Table  1  Performance  Results  for 
Submatriz  Generation 


by  better  utilizing  processor  memory.  Additional  I/O 
nodes  would  also  improve  performance. 


Parallelization  of  the  Equation  Solver 

In  order  to  determine  the  feasibility  of  executing  the 
equation  solver  in  parallel,  only  a  subset  of  the 
software  from  ABAQUS  that  addresses  linear 
systems  was  modified.  In  particular,  attention  was 
limited  to  the  subroutine  that  factors  symmetric 
stiffness  matrices.  The  method  used  by  ABAQUS  for 
matrix  factorization  is  an  implementation  of  the 
wave-front  algorithm  [2].  An  initial  examination  of 
the  implementation  suggested  the  use  of  a  hybrid 
decomposition  where; 

1.  One  processor  (the  manager)  executes  the 
factorization  routine  except  for  numerically- 
intensive  calculations.  The  other  processors 
(the  workers)  then  assist  the  manager  by 
performing  the  numerically-intensive 
calculations.  Among  the  tasks  left  to  the 
manager  are  bookkeeping,  stability  checks 
and  disk  I/O. 

2.  The  allocation  of  work  among  the  workers  is 
via  a  domain  decomposition  of  the  coeillcient 
matrix  of  the  wave-front.  The  domain 
decomposition  is  the  standard  decomposition 
that  has  been  used  elsewhere  to  solve  systems 
of  linear  equations  on  the  iPSC/2  (e.g. 
LINPACK);  the  columns  of  the  coefficient 
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matrix  are  distributed  in  a  round  robin 
fashion  to  participating  processors. 

As  an  example  to  illustrate  the  allocation,  consider  a 
four  processor  system.  The  first  processor  (node  #0 
in  the  numbering  scheme  of  the  iPSC/2)  executes  the 
factorization  subroutine  except  for  numerically 
intensive  operations.  Node  #1  then  performs  all 
numerical  calculations  associated  with  columns  1,  4, 
7,  10...  of  the  coefficient  matrix,  node  #2  performs 
the  same  calculations  for  columns  2,  5,  8,  11...  and 
node  #3  performs  in  the  same  fashion  for  columns  3, 
6,  9,  12....  The  result  is  two  separate  programs;  a 
manager  program  that  runs  on  node  0  and  a  worker 
program  that  runs  on  nodes  1, 2,  and  3. 

The  numerical  calculations  that  have  been 
transferred  from  the  manager  to  the  workers  are  all 
DO  loops  that  involve  two  specific  arrays.  The  first 
array  (GPA)  contains  the  coefficients  of  the 
equations  in  the  wave  front  while  the  second 
(BBAXO)  is  a  two-dimensional  array  of  all  equations 
that  become  fully  assembled  when  a  submatrix  is 
added  to  the  wave  front. 

The  division  of  work  and  transfer  of  data  between  the 
manager  and  worker  programs  are  depicted  in  Pig.  1. 
This  division  results  in  six  communication  points  in 
the  application,  of  which  four  involve  communication 
between  the  manager  and  worker  nodes. 

Results  for  Matrix  Factorization 

All  benchmarks  were  run  on  an  iPSC/2  with  SX 
nodes  and  8  MBytes  of  memory  per  node.  SX 
configurations  can  deliver  as  much  as  .5  MFLOPS 
per  processor  of  computational  power  in  double 
precision  floating  point  operations.  The  concurrent 
I/O  facility  consisted  of  two  I/O  nodes  with  2  disks  on 
each  node. 

Benchmarking  was  done  on  the  same  large  problem 
that  was  the  basis  of  the  results  that  were  presented 
for  the  generation  of  submatrix  data.  This  problem 
creates  a  linear  system  consisting  of  39000  equations 
with  half  bandwidth  of  440.  Execution  times  are 
presented  in  Table  2. 


# 

Processors 

Time  (sec.) 

Speed  up 

1 

38070 

1.0 

4 

13639 

2.8 

8 

7183 

5.3 

16 

4368 

8.1 

32 

4412 

8.6 

Table  2  Performance  Results  for 
Matrix  Factorization 


Concluding  Remarks 

Results  presented  herein  indicate  excellent  parallel 
performance  for  the  generation  of  submatrix  data 
and  acceptable  performance  for  matrix  factorization. 
With  the  availability  of  the  iPSC/860,  absolute 
performance  will  be  improved  significantly.  Some 
initial  results  bear  this  out.  Execution  of  submatrix 
generation  on  two  i860  nodes  improved  the 
applicable  results  of  Table  1  by  a  factor  of  11.  It  is 
anticipated  that  performance  for  matrix 
factorization  will  be  improved  by  a  factor  at  least  as 
large  as  this. 
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for  Spline  Collocation  Equations 
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Abstract 

We  study  the  parallel  computation  of  linear  second  order 
elliptic  Partial  Differential  Equation  (PDE)  poblems  in 
rectangular  domains.  We  discuss  the  application  of  Con¬ 
jugate  Gradient  (CG)  and  Preconditioned  Conjugate  Gra¬ 
dient  (PCG)  methods  to  the  linear  system  arising  from  the 
discretisation  of  such  problems  using  quadratic  splines 
and  the  collocation  ^scretisation  methodology.  Our 
experiments  show  that  the  number  of  iterations  required 
for  convergence  of  CG-QSC  (Conjugate  Gradient  applied 
to  Quadratic  Spline  Collocation  equations)  grows  linearly 
with  the  square  root  of  the  nuniber  of  equations.  We 
implemented  the  CG  and  PCG  methods  for  the  solution  of 
the  Quadratic  Spline  Collocation  (QSC)  equations  on  the 
iPSC/2  hypercube  and  present  performance  evaluation 
results  for  up  to  32  processors  configurations.  Our  exper¬ 
iments  show  efficiencies  of  the  order  of  90%,  for  both  the 
fixed  and  scaled  speedups. 

1.  Introduction. 

We  study  the  parallel  solution  of  the  Partial  Dif¬ 
ferential  Equation  (PDE)  problem 

Lu  =  aD]u  +  bD,DyU  +  cD\u  +  dD,u  +  eDyU  +fu=g  (1.1) 

in  £1  =  {ax,bx)  x  {ay, by) 

Bii  =  OK  +  PD„u  =go  on  dfl  5  boundary  of  (1  (1.2) 

The  input  functions  in  the  PDE  problem  (1.1)-(1.2)  are 
assumed  to  be  functions  of  x  and  y  in  C‘[£2],  while  D„ 
denotes  the  normal  derivative  of  u  on  dD. 

In  this  paper  we  discuss  the  application  of  Conju¬ 
gate  Gradient  (CG)  and  Preconditioned  Conjugate  Gra¬ 
dient  (PCG)  methods  to  the  linear  system  arising  from  the 
discretisation  of  the  above  problem  using  quadratic 
splines  and  the  collocation  discretisation  methodology. 
The  fact  that  we  have  used  quadratic  splines  does  not 
limit  the  importance  of  our  results,  since  the  use  of  other 
degree  splines  gives  rise  to  linear  systems  with  similar 
properties  to  those  of  the  quadratic  spline  equations. 
Discretisation  methods  other  than  Spline  Collocation  (SC) 
arc  known  to  give  rise  to  similar  structure  linear  systems. 

We  implemented  the  CG  and  PCG  methods  for  the 
solution  of  the  Quadratic  Spline  Collocation  (QSC)  equa¬ 


tions  on  the  iPSC/2  hypeicube  and  present  performance 
evaluation  results  fOT  up  to  32  processors  configurations. 
The  inqrlementation  can  be  straightforward  exteiKkd  to 
several  other  MIMD  architectures,  including  linear  array, 
2-dimensional  grid  of  processors,  as  well  as  sharul 
memory  machittes. 

The  methods  for  the  parallel  computation  of  PDEs 
can  be  classified  in  3  gener^  groups:  the  domain  decom¬ 
position  or  substructuring  methods,  in  which  we  assume 
the  decomposition  of  the  domain  of  problem  definitirm 
into  non-overlapping  subdomains,  the  domain  splitting 
methods,  in  whidi  the  domain  is  deconqrosed  into  over¬ 
lapping  subdomains,  and  those  methods  that  directly  use 
fire  parallelism  involved  in  the  process  of  solving  the 
linear  system  arising  from  the  discretisation  of  the  PDE 
problem. 

In  [Chri90a],  lCbri89],  [ChriSSa]  we  have  studied 
domain  decomposition  methods  for  solving  the  above 
PDE  problem  and  presented  the  results  from  the  imple¬ 
mentation  of  those  on  the  iPSCy2,  NCUBE/7  and 
SEQUENT  BALANCE  21000  parallel  machines.  In 
(Hous88b]  domain  splitting  nrethods  are  integrated  with 
cubic  spline  collocation  and  implemented  on  the 
NCUBEf7  hypercube.  This  paper  falls  in  the  third 
category  of  methods  for  the  parallel  computation  of 
PDEs. 

Many  researchers  have  studied  the  convergence 
and/or  parallel  implementation  of  CG  and  PCG  methods 
applied  to  systems  arising  from  the  discretisation  of  ellip¬ 
tic  problems  by  other  Finite  Element  Methods  (FEMs),  or 
to  the  Schur  complement  systems  arising  from  domain 
decomposition  methods  and  appropriate  reordering 
(KeyeSTJ,  [Dryj84],  [Bram86],  (Bjor86],  (Dryj86].  Oth¬ 
ers  fRodr86I,  (Tang87J  experiment  with  domain  splitting 
methods.  The  study  of  the  solution  of  SC  equations  is 
limited  due  to  the  fact  that  the  development  of  optimal 
schemes  for  two-dimensional  problems  is  very  recent 
{Hous87J,  (Irod87J,  [Chri88b],  and  due  to  the  lack  of 
some  nice  properties,  such  as  symmetry  and  positive 
definiteness,  which  are  often  standard  properties  for  other 
FEM  equations.  This  paper  is  the  first  successful  study  of 
the  application  of  CG  methods  to  SC  equations. 
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2.  The  Quadratic  Spline  Collocation  method. 

Spline  collocation  nsethods  have  been  proven  an 
efficient  alternative  for  solving  elliptic  PDEs  ^ous88a]. 
The  general  formulation  of  these  methods  for  the  discreti¬ 
sation  of  (1.1)-(1.2)  was  briefly  presented  in  [Chti89]  so 
we  do  not  include  it  here.  We  include  though  for  later 
reference  the  fomrulation  of  the  QSC  method  for  tat 
discretisation  of  (1.1)-(1.2)  in  the  case  that  =  0.  ie.  the 
boundary  operator  is  either  Dirichlet  or  Neumatm.  We 
assume  a  uniform  rectangular  mesh  A  s  {(Xi,yj):  i  =  0  to  n 
and  ;  =  0  to  m  }  in  f2.  on  which  we  define  a  tensor  product 
of  one-dimensional  quadratic  splines 

®  ®  ®  ^2.i  C*(Q) 

with  denoting  the  space  of  piecewise  biquadratic 
polynomials  with  respect  to  A.  The  one-dimensional  qua¬ 
dratic  splines  are  constructed  so  that  the  boundary  opera¬ 
tor  equation  (1.2)  is  satisfied  exactly  at  any  point  on  dQ. 

We  define  the  set  of  collocation  points  T/v  to  be  the 
set  of  midpoints  of  all  subrectangles  of  A.  Note  that  all 
the  coUocatioo  points  lie  in  the  interior  of  Q.  We  deter¬ 
mine  the  quadratic  spline  approximation  veStA  to  n  in 
two  steps  by  the  following  equations: 

Step  1:  Lv  =  g  on  (2.1) 

Step  2:  Lu^  =  g-PiVonT^  (2.2) 

where  Pi  is  appropriate  perturbation  operator,  defined  in 
IClm88b],  [airi9Cto).  The  first  step  solution  v  is  a  second 
order  approximation  to  u  and  is  a  fourth  order  one. 

The  QSC  equations  (2.1)  or  (2.2)  form  a  block  tridi¬ 
agonal  linear  system,  of  n-m  equations.  If  we  assume  that 
the  ordering  of  the  collocation  points  is  bottom-up  and 
then  left-to-right  every  block  is  of  order  m,  the  upper  and 
lower  bandwidth  is  m+1  and  there  are  n  blocks  on  the 
diagonal.  Figure  2.1  shows  the  pattern  of  non-zero 
entries  in  the  QSC  matrix. 


Figure  2.1.  Structure  of  the  matrix  of  QSC  equations  for 
n  =m  =  l.  X  denotes  a  non-zero  off-diagonal 
element,  d  a  non-zero  diagonal  one.  while  all 
zero  entries  are  represented  by  character 


3.  The  Preconditioned  Conjngate  Gradient  (PCG) 
method  for  solving  linear  systems. 

In  this  section  our  aim  is  to  recall  some  ismes  in  the 
parallel  implementation  of  the  PCG  method.  For  later 
reference  we  include  here  the  steps  of  the  PCG  algorithm 
for  solving  a  linear  system  Ax.  =  h  with  preconditioner  M, 
as  described  in  [Golu87].  The  superscripts  on  vectors  or 
scalars  denote  the  iteration  number  of  the  algorithm  at 
which  the  vectors  or  scalars  are  computed. 


PCG  algoridim  for  Ax  =  b 

1.  x®=  initial  guess 

2.  r°  =  b-Ax° 
fork  =  1.  maxit 

3.  if  1  |r*-'  1 1  <  £  (or  I  |r‘-‘  1 1  <  1  |r®  1 1  e)  exit 
else 

4.  solve  Afz*"‘  =  r*"‘ 

5.  = 

6. 

7.  a*  =  z*-' r*-*/p*Ap* 

8.  X*  =  x*“'  +  a*p* 

9.  r*  =  r*"'  -  cC^Ap* 

10.  endif 
endfor 


The  computational  requirenrents  of  every  PCG 
iteration  are  discussed  in  detail  in  [OrteSS].  From  the 
above  it  is  clear  that  tbe  parallel  implementation  of  the 
PCG  method  depends  very  much  on  the  implementation 
of  tbe  linked  triad  operation  (scalar-vector  multiplication 
and  vector  addition),  the  inner  product  operation,  tbe 
matrix-vector  multiplication,  tbe  back-and-forwaid  sub¬ 
stitutions  and  tbe  computation  of  tbe  norm.  There  are 
numerous  ways  to  implement  the  above  operations  on  a 
parallel  machine  [Orte88].  They  mainly  reflect  tbe  assign¬ 
ment  of  tbe  elements  of  the  matrices  A  and  M  and  tbe  vec¬ 
tors  r,  z,  p  and  x  to  the  processors. 

Tbe  CG  method  without  preconditioner  follows 
similar  steps,  with  tbe  exception  that  M  is  assumed  to  be 
tbe  identity  matrix,  so  tbe  back-and-forward  substitutions 
are  avoided.  Also,  tbe  computation  of  an  inner  product 
can  be  avoided,  when  using  tbe  EucUdean  norm  in  bne  3 
of  the  algorithm.  . 


4.  The  CG  and  PCG  methods  for  the  QSC  equations. 

We  first  experimented  with  the  convergence  of  the 
CG  iterations  applied  to  the  QSC  equations.  The  results 
show  that  the  number  of  CG  iterations  required  to  satisfy 
a  stopping  criterion  as  in  line  3  of  tbe  PCG  algorithm 
grows  linearly  with  the  square  root  of  the  order  of  the  sys¬ 
tem.  In  the  case  where  n  =  m  the  order  of  the  system  is 
O(n-),  so  the  number  of  iterations  grows  linearly  with  n. 
Table  (4.1)  shows  the  number  of  iterations  required  for 
the  CG-QSC  method  (CG  method  applied  to  QSC  equa- 
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tioDs)  when  applied  to  the  problem 

«xc+“xy=/  »nQ  =  (0,l)x(0,l)  (4.1a) 

u=g  in3Q.  (4.1b) 

/and  g  are  chosen  so  that  the  solution  to  the  problem  is 
“ (x,y)  =  jc*^^(jc-l)y *^(y-l).  The  initial  guess  that  was 
used  for  step  1  of  the  QSC  method  was  the  zero  vector, 
while  in  step  2  we  use  the  already  ccxnputed  solution  vec¬ 
tor  from  step  1.  The  relative  Euclidean  noim  of  the  resi¬ 
dual  was  used  for  the  stopping  criterion  in  line  3  of  the 
PCG  algorithm.  The  desired  precision  e  was  set  to  10"^. 
Figure  4.1  shows  graphically  the  data  of  Table  4.1. 

Table  4.1.  Number  of  iterations  required  for  the  conver¬ 
gence  of  the  CG  method  aiplied  to  the  QSC 
equations  (2.1)-(2.2)  for  several  grid  sizes. 


grid  size 

number  of 

number  of  iterations 

n-f-1 

equations 

step  1 

step  2 

5 

16 

5 

5 

9 

64 

11 

10 

17 

256 

24 

19 

25 

576 

36 

26 

33 

1024 

49 

34 

41 

1600 

63 

42 

49 

2304 

77 

50 

57 

3136 

91 

58 

65 

4096 

105 

66 

It  is  interesting  to  note  that  the  number  of  iterations 
required  for  convergence  of  step  2  is  exactly  n+\  (or 
n+2)  while  the  slope  of  the  number  of  iterations  curve 
required  for  convergence  of  step  1  is  about  1.6875. 


Figure  4,1.  Plot  of  the  number  of  iterations  required  for 
the  convergence  of  the  CG  method  when  ap¬ 
plied  to  QSC  equations  for  Problem  4.1 
versus  the  grid  size  in  one  dimension,  for 
both  steps  of  the  QSC  method. 

In  an  attempt  to  explain  why  the  number  of  itera¬ 
tions  rc(iuircd  for  the  convergence  of  tlie  CG-QSC 


method  grows  proportionally  with  n,  we  consider  a 
Helmholtz  problem  (b  =  d  =  e  =  0)  with  constant 
coefficients  (a,  c,  f  constants)  and  Dirichlet  boundary 
coalitions  (^  =  0)  and  state  a  theorem  which  is  proved  in 
[Chri90b]. 

Theoran.  Under  the  assumptions  that  a,  oO  and 

rt^( - - — r  + - - — r)  >  /.  spectral  norm  of 

(bx-axf  (by-ayf 

the  inverse  of  the  matrix  of  QSC  equations  in  the  case  of  a 
Helmholtz  problem  with  constant  coefficients  and  Diri¬ 
chlet  boundary  conditions  is  bounded,  ay «  — » oo,  m  — > 

A  similar  theorem  holds  in  the  case  of  Neumann  boun¬ 
ds  conditions.  Taking  in  account  that  the  norm  of  the 
matrix  of  QSC  equations  grows  proportionally  with  n^, 
we  conclude  that  the  condition  nunober  of  the  matrix  of 
QSC  equations  also  grows  proportionally  with  n^.  For  a 
symmetric  positive  definite  system  Ax  =b  we  know 
[Axel84]  that  the  number  of  CG  iterations  requited  for 
convergence  grows  proportionally  with  the  square  root  of 
the  condition  number  of  A.  For  the  case  of  a  Helnvholtz 
problem  with  constant  coefficients  and  Dirichlet  boundary 
conditions  the  matrix  of  QSC  equations  is  symmetric  and 
positive  definite,  so  the  number  of  iterations  required  for 
the  convergence  of  the  CG-QSC  method  grows  propor¬ 
tionally  with  n.  Figure  4.2  shows  the  behaviour  of  the 
residual  of  the  QSC  system  as  the  CG  iterations  proceed. 
It  is  interesting  to  note  that  for  PDE  problems  other  than 
the  Helmholtz  problem,  for  which  the  QSC  equations  are 
not  symmetric  we  have  successfully  applied  the  CG 
method  and  its  asymptotic  behaviour  was  not  extremely 
different  from  the  one  for  Problem  4.1. 


Figure  4.2.  Plot  of  the  residual  of  the  QSC  system  versus 
the  iteration  number  of  the  CG  algorithm  for 
Problem  4.1,  for  several  grid  sizes  and  for 
both  steps  of  the  QSC  method.  The  residual 
is  in  log  scale. 

We  also  experimented  with  the  performance  of  the 
CG-QSC  method  as  compared  with  solving  the  QSC 
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equations  with  standard  banded  LU  factorisation  (Band- 
LU).  Figure  4.3  shows  graphically  the  results.  The  slope 
of  the  curve  of  the  solution  time  versus  the  grid  size 
conesponding  to  the  CXj-QSC  method  is  clearly  lower 
than  the  one  for  Band-LU.  This  agrees  with  the  Aeoreti- 
cally  expected  performance  of  the  two  methods,  since 
Band-LU  is  0(n^),  while  CG-QSC  is  O(n^),  where  we 
have  again  assumed  that  n  =  m.  Note  that  this  holds  for 
step  1  of  the  QSC  method.  For  step  2  both  Band-LU  and 
CG  methods  are  O(n^),  assuming  the  factorisation  of  the 
matrix  of  QSC  equations  is  saved  from  step  1. 

The  CG  method  is  not  the  only  iterative  solver  that 
is  faster  than  direct  band  solvers  for  the  QSC  equations. 
In  [Hous88a]  we  experiment  with  several  iterative 
solvers,  that  outperform  the  direct  ones  in  both  time  and 
memory  requirements. 


Figure  4  J.  Log-log  scale  plot  of  the  time  in  milliseconds 
for  the  solution  of  step  1  of  the  QSC  equa¬ 
tions  with  the  CG  and  Band-LU  methods 
versus  the  grid  size  in  one  dimension.  The 
computation  was  carried  out  on  one  proces¬ 
sor  of  the  iPSC/2  hypercube. 

For  matrices  of  block  tridiagonal  stmcture  it  is 
quite  common  to  choose  as  preconditioner  the  tridiagonal 
part  of  the  original  matrix,  in  order  to  accelerate  the  con¬ 
vergence  rate  of  the  CG  method.  In  the  case  of  QSC 
equations  the  tridiagonal  part  T  of  the  matrix  is  also 
block-diagonal.  When  the  CG  method  is  applied  to  the 
QSC  equations  arising  from  the  discretisation  of  a  PDF 
problem  with  operator  other  than  the  Laplace  operator, 
our  experiments  show  that  using  T  as  preconditioner 
accelerated  the  convergence  of  the  CG  method.  In 
[Chri90c)  we  study  the  construction  of  appropriate 
preconditioners  for  the  QSC  equations.  In  the  rest  of  the 
paper  any  reference  to  the  CG-QSC  method  will  assume 
that  an  appropriate  prdconditioncr  is  used  whenever 
necessary. 


5.  Implementation  of  the  CG-QSC  method  on  hyper- 
cube  architectures. 

In  this  section  we  discuss  in  naore  detail  how  the 
computation  involved  in  the  CG-QSC  method  is  mapped 
on  bypercube  architectures.  Although  we  limit  this  dis¬ 
cussion  in  a  specific  MIMD  architecture,  most  of  the 
ideas  presented  are  straightforward  implemented  on  other 
type  of  local  memory  machines  as  well  as  shared  memory 
ones. 

5.1.  Distribution  of  the  data  to  processors. 

The  first  thing  in  the  inqrlementation  of  an  algo¬ 
rithm,  in  which  certain  parallelism  is  identified,  on  a 
specific  local  memory  machine,  is  to  distribute  the  data  in 
the  local  memory  of  the  processors,  so  that  crxnmunica- 
tion  is  "mininnsed"  and  as  little  as  possible  data  is  dupli¬ 
cated.  In  the  case  of  CG-QSC  method  the  distribution  of 
data  is  motivated  by  the  parallel  irr^lementation  of  tire 
individual  steps  of  the  PCG  algoritto  as  described  in 
Section  3.  Although  this  distribution  of  data  bolds  only 
for  local  memory  machines  in  the  case  of  shared  memory 
machines  this  distribution  reflects  the  way  the  processors 
are  going  to  address  the  shared  memory. 

The  matrix  A  of  QSC  equations  is  stored  by  rows  in 
a  sparse  matrix  storage  scheme,  so  that  only  the  non-zero 
elements  of  every  row  are  stored.  Every  processor  holds 
the  rows  corresponding  to  one  or  more  blocks  in  the  block 
notation  of  the  matrix.  For  simplicity  we  assume  the 
number  of  processors  P  divides  n  exactly.  So  every  pro¬ 
cessor  will  store  equations.  Also  every  processor  will 

store  the  respective  rows  of  the  vectors  r,  z  and  x.  As  far 
as  the  direction  vector  p  is  concerned  a  processor  will 
update  those  components  corresponding  to  the  rows  of  A 
it  stores,  but  will  have  storage  for  the  "neighbouring" 
components,  more  specifically  m  positions  on  the  top  of 
the  part  it  is  going  to  update  and  m  positions  at  the  bot¬ 
tom. 

5.2.  Parallel  discretisation  of  the  PDE  problem. 

The  discretisation  process  of  a  PDE  problem  with 
the  collocation  methodology  is  by  definition  pointwise.  so 
it  is  totally  asynchronous,  assuming  a  distribution  of  the 
collocation  points  to  the  processors.  In  our  case  of  QSC 
with  the  midpoints  as  collocation  points,  and  a  bottom-up 
left-to-right  numbering  of  them,  the  parallel  generation  of 
the  matrix  A  can  be  viewed  as  a  line  collocation  method. 
Every  processor  generates  the  rows  of  A  it  is  assigned  to, 
that  is,  the  equations  corresponding  to  one  or  more  verti¬ 
cal  grid  lines,  with  no  need  to  communicate  with  any 
other  processor. 
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S3.  Computing  the  product  of  A  by  a  vector. 

According  to  the  above  assignment  of  the  elements 
of  A  and  p  to  the  processors,  a  processor  will  c(Miq)ute  the 
inner  product  of  the  rows  of  A  it  holds  with  the  vector  p. 
Due  to  the  block-tridiagonal  structure  of  the  matrix  A  any 
processor  needs  to  receive  at  most  2m  components  of  p  by 
other  processors,  the  rest  reside  already  in  the  local 
memory  of  the  processor.  This  fact  has  two  nice  effects: 
First,  that  only  neighbour  communication  is  necessary,  if 
we  assume  that  the  assignment  of  blocks  of  rows  of  A  to 
the  processors  is  done  according  to  the  standard  gray  code 
ordering  of  the  processors.  Second,  that  the  amount  of 
data  transfer  per  processor  does  not  grow  with  the  number 
of  processors.  It  only  grows  with  m,  the  grid  size  in  one 
direction.  This  helps  so  that  the  speedup  does  not  degrade 
much  as  the  number  of  processors  increases. 

5.4.  Preconditioiiing. 

The  assignment  of  the  blocks  of  T  to  the  processors 
is  similar  to  that  of  A.  Since  T  is  block  diagonal,  every 
processor  can  work  independently  for  the  back-and- 
forward  substitutions  of  the  blocks  of  T  it  is  assigned  to. 
So  preconditioning  with  T  does  not  increase  the  commun¬ 
ication  overhead. 

5.5.  Computation  of  inner  product  of  vectors  and 
norms. 

For  the  inner  products  in  lines  5  and  7  of  the  PCG 
algorithm  the  well  known  fan-in  technique  is  used.  More 
specifically,  we  use  global  fan-in  so  that  the  final  result 
resides  in  all  processors,  instead  of  a  fan-in  in  one  proces¬ 
sor  and  a  fan-out  broadcast  of  the  final  result  to  all  other 
processors.  The  parallel  confutation  of  the  norm  of  the 
vector  depends  on  the  norm  used.  For  the  infinity  norm  a 
global  fan-in  comparison  scheme  is  used,  while  for  the 
Euclidean  norm  a  global  fan-in  summation. 

6.  Performance  results. 

In  this  section  we  discuss  the  performance  of  the 
CG-QSC  method  on  various  processor  configurations  of 
the  iPSC/2  hypercube.  We  refer  to  the  basic  computa¬ 
tional  constructs  of  the  CG-QSC  method  as  discretisation 
process,  solution  process  and  per  iteration  process.  The 
per  iteration  process  includes  the  computation  of  one 
(P)CG  iteration,  while  the  solution  process  inciudes  the 
factorisation  of  the  preconditioner  (if  there  is  one)  and  the 
computation  of  all  iterations. 

6.1.  Speedup  and  efficiency. 

We  first  measure  the  so  called  scaled  speedup 
IGustSS],  IOrtc881.  According  to  the  definition  of  scaled 
speedup  we  need  to  choose  a  different  size  problem  for 
each  processor  configuration.  For  flic  solution  (iroccss  of 


the  CG-QSC  method,  the  problem  size,  that  is,  the  opera¬ 
tion  counts,  is  0(n^),  while  for  each  iteration  it  is  0  (n\ 
as  it  is  for  the  discretisation  process.  More  specifically, 
we  scale  the  problem  size  as  follows:  Let  n  be  the  number 
of  grid  points  in  one  dimension,  for  which  we  let  the  CG- 
QSC  program  to  run  on  a  single  processor.  We  then 
choose  np  to  be  such  that  np  =  Pn^  and  let  the  CG-QSC 
program  to  run  on  F  processors  for  a  grid  size  np.  Then 
the  scaled  speedup  for  the  (solution  process  of  the)  CG- 
UW 

QSC  algorithm  is  — ; — --P,  where  ti(J)  is  the  time 
tp(np) 

elapsed  for  the  execution  of  the  program  on  i  processors 
and  grid  size  J  in  one  dimension.  Similarly  for  the  discre¬ 
tisation  process  and  the  per  iteration  computation  we 
choose  np  such  that  np  =  Pn^.  Altmiatively,  we  can 
compute  the  scaled  speedup  for  2  different  grid  sizes  n 
h(n)  rPp 


and  np  as 


-,  for  the  solution  process  of  the 


hlnp') 

tj(n)  np 

CG-QSC  algorithm  and  as - ir,  for  the  discretisa- 

tp(np)  n^ 

tion  and  per  iteration  processes. 


In  Figure  6.1  we  plot  the  estimated  scaled  speedup 
for  the  discretisation,  solution  and  per  iteration  processes 
of  the  CG-QSC  algorithm.  The  grid  sizes  for  this  plot 
vary  from  25  for  a  single  processor  to  97  for  32  proces¬ 
sors.  The  discretisation  process  does  not  suffer  from  any 
communication  overhead  and  the  slight  degradation  of  the 
^reedup  away  from  the  linear  one  is  due  to  dupUcate 
computations  done  in  all  processors,  in  order  to  initialise 
certain  parameters  of  the  problem,  as  well  as  to  a  few 
differences  in  the  code  for  a  single  processor  from  that  for 
multiple  processors.  The  speedup  curves  for  the  solution 
and  per  iteration  processes  look  very  similar.  The  degra¬ 
dation  of  speedup  in  these  cases  is  mainly  due  to  com¬ 
munication  overhead  as  well  as  synchronisation  and  load 
balancing.  We  would  like  to  point  out  that  the  CG-QSC 
algorithm  is  perfectly  load  balanced  as  far  as  computation 
is  concerned.  Communication  is  also  well  load  balanced 
with  the  exception  of  the  nearest  neighbour  communica¬ 
tion,  that  is  required  for  the  computation  of  the  matrix- 
vector  product  Ap,  in  which  the  first  and  last  processors 
remain  idle,  during  the  time  the  others  exchange  m  com¬ 
ponents  of  a  vector.  Also,  the  unreliability  of  the 
hardware  might  cause  some  load  imbalance. 


Based  on  the  speedups  plotted  in  Figure  6.1  the 
efficiency  of  the  discretisation  process  ranges  from  90% 
to  98%,  while  the  efficiency  of  the  solution  and  per  itera¬ 
tion  processes  range  from  79%  to  93%. 
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Figure  6.1.  Measured  speedup  for  the  discretisation, 
solution  and  per  iteration  times  of  CG-QSC 
on  the  iPSC^  for  up  to  32  processors 
configurations. 

We  next  plot  the  fixed  speedup  for  grid  sizes  97  and 
65  in  Figures  6.2  and  6.3  respectively.  This  turns  out  to 
be  better  than  the  scaled  one  for  small  number  of  proces¬ 
sors,  but  degrades  faster  for  large  number  of  processors. 
This  comes  from  the  fact  that  the  fixed  speedup  suffers 
fron  the  overhead  of  carrying  out  small  amount  of  com¬ 
putation  in  each  processor.  It  is  clear  that  the  slope  of  fire 
fixed  speedup  curve  for  large  number  of  processors  is 
lower  than  that  of  the  scaled  one,  and  that  the  65  grid  size 
speedup  is  worse  than  the  97  grid  size  one. 


Fignre  63.  Measured  speedup  fot  the  discretisation, 
solution  and  per  iteration  times  of  CG-QSC 
on  the  iPSC/2  for  up  to  32  processors 
configurations  and  fixed  65x65  grid. 

It  is  interesting  to  note  that  the  efficiencies  of  the 
processors  based  on  the  fixed  speedups  ate  100%  fw  2 
processors  for  both  the  grid  sizes  shown  in  the  two 
figures.  More  specifically,  the  efficiency  based  <»  fixed 
speedup  varies  from  94%  (91%)  to  100%  for  tiie  discreti¬ 
sation  process  and  grid  size  97  (65),  and  from  83%  (69%) 
to  100%  for  the  solution  and  per  iteration  processes  for 
the  same  grid  size(s). 

In  Figure  6.3  we  also  plot  the  "speedup"  of  the 
CG-QSC  solution  process  with  respect  to  the  Band-LU 
solution  process  carried  out  in  a  single  processor.  This  is 
clearly  superlinear,  due  to  the  merits  of  the  CG-QSC 
method.  We  were  unable  to  run  the  Band-LU  algorithm 
for  larger  than  65  grid  sizes,  due  to  the  limit  in  the  local 
memory  of  a  processor. 

Finally,  in  Table  6.1  we  include  $(xne  of  the  numer¬ 
ical  data  that  was  used  to  draw  Figures  6.1,  6.2  and  6.3. 
This  table  has  also  a  column  for  the  memory  requirements 
of  the  two  methods/solvers.  The  memory  requirements  of 
the  CG-QSC  method  are  far  less  than  those  for  Band-LU. 
The  memory  requirements  of  CG-QSC  decrease  almost 
linearly  with  the  number  of  processors. 


Figure  63.  Measured  speedup  for  the  discretisation, 
solution  and  per  iteration  times  of  CG-QSC 
on  the  iPSC/2  for  up  to  32  processors 
configurations  and  fixed  97x97  grid. 
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Table  6.1.  Time  in  milliseconds  on  the  iPSC/2  for  the 
discretisation,  solution  and  iteration 
processes  of  the  Band  LU  factorisation  and 
CG-QSC  algorithms  for  various  grid  sizes 
and  processor  configurations.  A  means 
not  applicable.  The  memory  requirements 
are  in  floating-point  numbers. 


method 

n-l-1 

discret. 

solution 

per  iteration 

memoiy 

P 

Band-LU 

25 

8460 

5048 

- 

35712 

1 

CG-QSC 

8446 

5716 

141.30 

10944 

1 

4234 

2958 

73.10 

5496 

2 

2156 

1664 

41.20 

2784 

4 

1120 

1028 

25.50 

1416 

8 

Band-LU 

33 

14952 

14228 

- 

79872 

1 

CG-QSC 

14948 

13762 

252.70 

19456 

1 

7492 

7040 

129.26 

9760 

2 

3792 

3827 

70,27 

4930 

4 

1952 

2222 

40.86 

2498 

8 

1020 

1452 

26.74 

1282 

16 

560 

1130 

20.78 

674 

32 

Band-LU 

41 

23268 

32960 

- 

150400 

1 

CG-QSC 

5892 

7164 

106.27 

7680 

4 

Band-LU 

49 

33360 

65464 

- 

253440 

1 

CG-QSC 

4296 

6617 

81.25 

5568 

8 

Band-LU 

65 

59028 

197712 

- 

581632 

1 

CG-QSC 

59020 

110876 

1022.37 

77824 

1 

29560 

55748 

514.06 

38976 

2 

14876 

28425 

262.12 

19584 

4 

7556 

14926 

137.65 

9856 

8 

3848 

8210 

75.76 

4992 

16 

2024 

5022 

46.35 

2560 

32 

97 

132504 

349274 

2306.25 

175104 

1 

66432 

175340 

1157.76 

87648 

2 

33292 

88472 

584.16 

43968 

4 

16820 

45314 

299.22 

22080 

8 

8504 

23624 

156,00 

11136 

16 

4400 

13217 

87.29 

5664 

32 

6.2.  Communication  time. 

As  explained  before,  the  communication  overhead 
of  the  CG-QSC  algorithm  is  due  to  the  matrix-vector  mul¬ 
tiplication  (neighbour  communication)  and  the  inner  pro¬ 
duct  and  norm  computation  (global  communication).  In 
order  to  verify  the  theoretically  obtained  result  of  Section 

5.3,  that  the  communication  overhead  for  computing  the 
product  of  /t  by  a  vector  does  not  increase  with  the 
number  of  processors,  we  attempt  to  measure  the  time 
spent  in  communication,  during  the  compulation  of  the 
produc*  of  A  by  a  vector  in  the  following  way.  We  let  our 
code  nin,  skipping  all  the  computation  statements  and 
executing  only  the  semUreceive  o|)eralioi)s  of  Ihc  iPSC/2 
hyiK'rcubc.  For  a  belter  accurac)  we  let  it  carry  out 


several  iterations  and  then  take  the  average  of  the  time 
elapsed.  It  is  our  understanding  that  in  this  way  we 
include  in  our  measurements  the  computation  time 
tequired  for  addressing  the  message  buffers  and  the  over¬ 
head  spent  in  synchronisation  and  load  balancing.  Table 
6.2  lists  the  average  communication  time  measured  in  this 
way.  for  several  problem  sizes  and  number  of  processors. 

Table  6.2.  Communication  time  in  milliseconds  on  the 
iPSC/2  during  the  computation  of  the 
matrix-vector  multiplication  for  various  grid 
sizes  and  processors  configurations. 


P 

n-l-1 

2 

4 

8 

16 

32 

25 

0.64 

1.28 

1.28 

1.28 

1.28 

49 

1.28 

2.56 

2.88 

2.88 

2.88 

81 

1.28 

2.88 

2.88 

2.88 

2.88 

From  the  results  of  Table  6.2  it  is  clear  that  the 
communication  overhead  of  our  implementation  is  not 
affected  by  the  number  of  processors,  except  in  the  case 
of  2  processors,  in  which  there  is  only  one-way  communi¬ 
cation.  We  find  these  timings  quite  consistent  with  our 
theoretical  statements  taking  in  account  the  factors  of 
clock  accuracy  and  unreliability  of  the  hardware.  We 
also  note  that  the  communication  overhead  is  affected 
(not  necessarily  linearly)  by  the  size  of  the  problem.  This 
agrees  quite  well  with  the  communication  performance 
report  for  the  0*502  hypercube  [InteSS],  where  it  is 
stated  that  the  time  for  node-to-node  communication  is 
about  the  same  for  0-100  bytes  messages  (up  to  25x25 
grid),  it  is  about  double  for  a  message  of  104  bytes  length, 
than  for  one  of  100  bytes  length,  and  varies  slightly  for 
messages  of  104-1024  bytes  length  (this  includes  the  larg¬ 
est  grid  size,  for  which  the  CG-QSC  method  was  tested). 

We  have  carried  out  similar  experiments  in  order  to 
measure  the  time  spent  in  communication  during  the  com¬ 
putation  of  the  inner  products  and  norms  in  every  iteration 
of  the  PCG  algorithm.  Our  experiments  show  that  this 
time  increases  linearly  with  the  dimension  of  the  hyper¬ 
cube  (log(P)),  as  expected.  Based  on  our  experiments  the 
global  fan-in  summation  of  the  partial  inner  products 
takes  2-4  milliseconds  for  2-32  processors  configurations. 
Taking  in  account  that  every  PCG  iteration  requires  3 
times  this  type  of  global  communication  and  once  the 
neighbour  communication  for  the  matrix-vector  multipli¬ 
cation  we  conclude  that  the  time  spent  in  communication 
is  less  than  1  %  of  the  total  time  for  the  case  of  97x97  grid 
and  2  processors,  while  it  is  about  16%  of  the  total  time 
for  the  same  grid  size  and  32  processors.  This  means  that 
almost  all  what  is  lost  in  efficiency  is  due  to  communica¬ 
tion  overhead,  and  it  leads  us  to  suggest  (oikc  again!)  that 
in  order  to  benefit  from  the  use  of  a  lot  of  priKessors,  »c 
have  to  solve  proh'ems  of  appropriately  large  si/e. 
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Abstract 

Multigrid  is  a  fast  iterative  method  used  to  solve 
linear  partial  differential  equations.  However,  be¬ 
cause  the  solution  of  very  small  problems  is  inher¬ 
ent  in  the  multigrid  iteration,  it  is  difficult  to  imple¬ 
ment  efficiently  on  a  massively  parallel  computer. 
In  this  paper,  we  present  an  implementation  cf  the 
multigrid  v-cycle  that  has  achieved  84%  efficiency  on 
the  1,024  processor  NCUBE/ten.  We  also  present  a 
model  for  the  efficiency  of  multigrid  on  a  parallel 
computer  that  depends  only  on  the  efficiency  of  the 
smoother  at  each  level.  This  model  can  be  used  to 
verify  that  it  is  indeed  difficult  to  obtain  extremely 
high  efficiencies  (95%  to  100%),  but  that  it  is  rel¬ 
atively  easy  to  obtain  moderately  high  efficiencies 
(70%  to  85%). 

Introduction 

Multigrid  methods  are  popular  iterative  method 
for  solving  partial  differential  equations  (PDEs)  nu¬ 
merically.  These  methods  make  use  of  multiple  grids 
of  unknowns  to  reduce  the  dependence  of  the  num¬ 
ber  of  iterations  required  for  convergence  on  the 
problem  size,  in  contrast  to  iterative  techniques  such 
as  Jacobi,  Gauss-Seidel  and  finite  precision  conju¬ 
gate  gradient  iterations.  Also,  unlike  other  fast  el¬ 
liptic  solvers,  multigrid  methods  are  applicable  to  a 
wide  range  of  problems,  although  their  implementa¬ 
tion  becomes  more  difficult  for  irregular  domains  or 
irregular  grids. 

Because  of  its  usefulness  as  an  iterative  solver  for 
PDEs,  there  have  been  many  attempts  to  implement 
multigrid  efficiently  on  parallel  computers  [1,3,4, 7]. 
These  have  generally  been  carried  out  for  shared 
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memory  computers  with  4  to  16  processors  and  for 
distributed  memory  machines  with  16  to  256  pro¬ 
cessors.  The  best  of  these  implementations  have 
achieved  overall  efficiencies  between  75%  zmd  85% 
when  solving  the  largest  problem  possible  on  their 
computer.  Higher  efficiencies  are  very  difficult  to  at¬ 
tain  because  of  the  serial  nature  of  standard  multi¬ 
grid  algorithms  and  the  small  number  of  unknowns 
on  coarse  grids. 

Several  variants  of  the  standard  multigrid  algo¬ 
rithm  for  pjiraliel  computers  have  also  been  devel¬ 
oped.  These  include  algorithms  based  on  multiple 
coarse  grids  [5],  algorithms  based  on  simultaneous 
smoothing  on  several  grids  [6],  and  algorithms  for 
residual  splitting  to  allow  the  simultaneous  reduc¬ 
tion  of  different  frequency  erorrs  [3].  These  variants 
are  not  always  effective  as  the  increased  efficiency 
is  offset  by  increased  computational  requirements, 
communication  requirements  and  program  complex- 
ity. 

In  this  paper,  we  present  an  implementation  of 
multigrid  for  the  NCUBE/ten  that  achieves  84%  effi¬ 
ciency  when  using  1,024  processors.  We  also  present 
a  model  of  the  efficiency  of  multigrid  algorithms  that 
distiguishes  between  the  efficiency  of  multigrid  and 
the  efficiency  of  the  smoother.  Finally,  we  compare 
our  implementation  of  multigrid  on  the  NCUBE/ten 
to  the  predictions  of  the  model. 

Implementation 

Our  multigrid  implementation  is  the  v-cycle,  in 
which  the  iteration  begins  on  the  finest  grid,  pro¬ 
gresses  sequentially  to  the  coarsest  grid,  and  then 
returns  to  the  fine  grid  (Figure  1). 

In  a  parallel  implementation  of  multigrid,  proces¬ 
sors  can  be  idle  on  the  coarsest  levels  while  the  com¬ 
putations  proceed  at  a  small  number  of  points.  This 
problem  is  particluarly  severe  on  massively  parallel 
machines,  such  as  the  NCUBE/ten  and  the  Connec¬ 
tion  Machine.  While  this  problem  cannot  be  elimi¬ 
nated  for  the  v-cycle,  its  effects  can  be  minimized  by 
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Figure  1.  The  multigrid  v-cycle  with  five  levels 


striction  operator  requires  one  more  communication 
step  than  injection,  which  is  another  commonly  used 
r^triction  operator.  However,  because  the  the  full 
weighting  operator  is  the  transpose  of  the  bilinear 
correction  operator,  convergence  is  guaranteed  for  a 
wide  range  of  problems. 


Numerical  Results 


The  multigrid  algorithm  described  in  the  previous 
section  was  implemented  on  the  NCUBE/ten  and 
tested  by  solving  the  following  problem: 


d^u  d^u 
dx^  ^ dy^ 


(1  +  e)jr^  sin(jrx)  sin(5rj/), 
(x,t/)G(0,l)x(0,l), 


using  a  process  known  as  agglomeration  [7].  When¬ 
ever  a  processor  is  responsible  for  a  computational 
domain  with  too  few  unknowns,  the  domain  is  com¬ 
bined  with  that  of  a  neighboring  processor.  If  the 
fine-grid  computations  are  load  balanced,  many  of 
these  combinations  can  be  done  in  parallel.  On  a 
hypercube,  this  is  equivalent  to  having  half  of  the 
cube  duplicate  the  computations  of  the  other  half. 
Because  each  half  of  the  hypercube  is  also  a  hyper¬ 
cube,  and  because  the  processors  that  are  combin¬ 
ing  work  are  connected,  nearest  neighbor  commu¬ 
nications  are  maintained.  While  agglomeration  im¬ 
proves  the  efficiency  at  a  given  level  only  slightly,  it 
reduces  the  communication  required  for  the  transfer 
between  levels  significantly. 

The  optimum  number  of  unknowns  at  which  ag¬ 
glomeration  will  occur  is  a  function  of  the  machine 
architecture.  On  the  NCUBE/ten,  we  found  that 
two  processors  should  be  combined  when  they  are 
each  responsible  for  two  unknowns  along  any  di¬ 
mension  of  the  problem.  The  model  developed  in 
the  next  section  can  be  used  to  verify  this  and  to 
determine  the  exact  dependence  on  the  hardware 
parameters. 

In  our  implementation  of  multigrid,  we  use  a  red- 
black  Gauss-Seidel  smoother  on  each  level.  While 
this  smoother  is  not  as  easily  parallelized  as  a  Jacobi 
smoother,  the  improvement  in  the  multigrid  conver¬ 
gence  rate  is  sufficient  to  justify  its  use.  On  the 
other  hand,  the  use  of  an  SOR  smoother  results  in 
a  comparable  convergence  rate,  but  does  not  paral¬ 
lelize  easily. 

To  transfer  information  between  the  grids,  we 
use  full  weighting  restriction  and  bilinear  correction 
(prolongation)  operators  [2].  The  full  weighting  re- 


with  the  boundary  conditions 

u(x,0)  =  u(i,l)  =  0  xe[0, 1] 

«(0,y)  =  u(l,y)  =  0  y€[0,l]. 


We  discretize  this  equation  by  setting  Ax  =  Ay  = 
1/n,  n  >  0  and  replacing  the  partial  derivatives 
with  second-order  finite  differences.  The  mesh  is 
distributed  over  a  p  x  p  grid  of  processors  by  assign¬ 
ing  a  (n/p)  X  (n/p)  mesh  to  each  processor.  For  each 
run  of  the  program,  we  set  e  =  .1,  smooth  once  on 
each  level,  and  consider  the  iterations  to  have  con¬ 
verged  when  the  relative  residual  is  less  than  10~®. 

As  a  measure  of  the  performance  of  the  algorithm, 
we  use  efficiency  and  scaled  efficiency.  If  T{n,p)  is 
the  time  per  iteration  for  the  v-cycle  with  an  n  x  n 
mesh  on  a  px  p  grid  of  processors,  then  the  efficiency 
is  defined  to  be 


e(n,p) 


Tjn, 1) 

P^T{n,p)’ 


which  is  equivalent  to  the  speedup  divided  by  the 
number  of  processors.  In  our  implementation,  a  64  x 
64  grid  of  unknowns  is  the  largest  allowed  on  one 
processor.  Hence,  for  n  >  64  we  define  the  scaled 
efficiency 


se(n,p) 


T{n/p,l) 
T{n,p)  ' 


which  is  equivalent  to  the  scaled  speedup  [8]  divided 
by  the  number  of  processors.  We  note  that  because 
the  serial  run  time  for  one  v-cycle  is  proportional  to 
the  number  of  unknowns,  we  have  e(n,p)  w  se{n,p). 
Efficiencies  and  scaled  efficiencies  for  the  model 
problem  are  shown  in  Table  1. 
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We  see  in  Table  1  that  the  degradation  in  per¬ 
formance  due  to  idle  processors  as  n/p  decreases  is 
quite  severe.  For  the  cases  n  =  4  and  n  =  8,  we 
have  efficiencies  less  than  1/p^,  which  corresponds 
to  a  speedup  of  less  than  one.  We  also  note  that  by 
scaling  the  problem  with  the  number  of  processors, 
we  obtain  relatively  good  efficiencies,  even  on  1,024 
processors.  We  conclude  that  the  multigrid  v-cycle 
has  a  relatively  high  serial  content,  and  that  scaling 
the  problem  size  with  the  number  of  processors  is 
more  important  than  with  more  parallelizable  algo¬ 
rithms  such  as  Jacobi  relaxation. 


Model  of  Multigrid  Efficiency 

In  this  section,  we  develop  an  expresMon  for  effi¬ 
ciency  of  the  multigrid  v-cycle  in  terms  of  the  effi¬ 
ciency  of  the  smoother  used  at  each  level.  Although 
the  model  is  developed  only  for  the  v-cycle,  the  tech¬ 
niques  are  applicable  to  any  cycle  [9]. 

We  suppose  that  we  are  approximating  the  solu¬ 
tion  to  a  d-dimensional  PDE  at  n**  evenly  spaced 
mesh  points  distributed  among  processors.  We 
use  using  a  ib-level,  k  <  log2  n  multigrid  algorithm, 
where  level  1  is  the  coarsest  grid,  and  level  k  is 
the  finest  grid.  For  t  =  1,2, ...,it,  we  let  c<(n,p), 
Oi(n,p)  and  c,-(n)  denote  the  efficiency,  the  parallel 
overhead  and  the  computational  work  for  the  i-level 
multigrid  algorithm.  We  note  that  the  computa¬ 
tional  work  c,(n)  does  not  depend  on  the  number  of 
processors  p.  Similarly,  we  let  Aej(n,p),  Ao,(n,p) 
and  Aci(R)  denote  the  corresponding  quantities  for 
the  smoother  at  level  i.  Our  goal  is  to  develop  an 
expression  for  Ci(n,p)  in  terms  of  ct-iCj.p) 
Aet(n,p). 

The  definition  of  efficiency  is 


tion  yields 


Ci(n,p)  = 


o.(n,p)  +  Ci(n)' 


Because 


Ci(n)  =  Acj(n), 


o,(n,p)  =  Oj_i(-,p)  + Ao.(n,p), 
we  can  write  the  efficiency  as 


e<(n,p)  = 


0t-l(3.p)  +  ^0<(".P) 
c,_i(*)+ae,(n) 


e,_i(t,p)Ae,(n,p)(l-»-;J^) 
•  («.P)  -  A..  / - \  I  Ae.Cn)  ^  tn\ 


ACi(n,p)-|- 


^•-1  (2) 


The  ratio  Ac,(n)/c,_i(|)  depends  only  on  the 
computational  work  in  the  serial  algorithm  and 
can  be  approximated.  The  work  required  by  the 
smoother  on  level  j  can  be  written  in  terms  of  the 
work  on  level  i  as  follows; 

(^)  =  (2^)  J  =  1' •  ••■  *=• 

Now 

i 

i=i 

Thus,  for  a  v-cycle,  we  have 

Substituting  this  relationship  into  (1)  yields  the  re- 


e.(n,p)  » 


^)Ae.(n,p) 


The  initial  condition, 

ei(2‘"*n,p)  =  Aei(2*~*n,p), 

where  k  is  the  number  of  levels,  simply  states  that 
the  efficiency  of  multigrid  is  the  same  as  the  effi¬ 
ciency  of  the  smoother  when  only  one  level  is  used. 

To  actually  predict  the  efficiencies  of  a  multigrid 
v-cycle,  we  need  to  be  able  to  predict  the  efficiency 
of  the  smoother,  Ae,(n,p).  For  red-black  Gauss- 
Seidel,  we  use 


Aei(n,p)  = 


g-  (?)' 

Cl{^y +  C2{f)+C3\og2P^  +  C4 


Eliminating  p)  and  Aoi(n,p)  from  this  equa- 


Ci  =2.13x  10-‘'±7.2x  10-® 
C2  =  4.23x  10-'‘±4.8x  10"® 
C3  =  8.23x  10-‘'±1.2x  10"® 
C4  =  9.73x  10-3  ±  9.5  X  lO"®. 


This  equation  is  based  on  operation  counts 
and  a  least  squares  fit  to  timing  data  for  the 
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Figure  2.  TKt  efficiency  of  an  i-level  multigrid  v-cycle  as  a 
function  of  the  efficiency  of  an  (i  —  2)-leuel  multigrid  v-cycle. 
If  the  efficiency  of  the  smoother  on  the  finest  level  is  greater 
than  95%  and  on  the  second  finest  level  is  greater  than  90%, 
then  the  i-level  muliigrid  efficiency  must  lie  ietween  the  two 
curves. 

NCUBE/ten  [9].  Conbining  (2)  and  (3)  yields  the 
predictions  of  multigrid  efficiency  shown  in  Table  2. 

We  see  from  the  multigrid  efficiency  model  (2) 
that  the  coefficient  of  the  fine-grid  efficiency  is  much 
larger  than  that  of  the  coarse-grid  efficiency.  We 
conclude  that  the  efficiency  of  the  fine-grid  smoother 
influences  the  efficiency  of  multigrid  more  strongly 
than  the  coarse-grid  smoothing.  Thus,  if  an  efficient 
smoother,  such  as  Jacobi  or  red-black  Gauss-Seidel, 
is  used,  moderately  high  efficiencies  for  multigrid 
can  be  achieved  easily.  Figure  2  demonstrates  this. 

We  also  note  from  (2)  that  as  the  dimension,  d, 
increases,  the  weighting  of  the  fine-grid  efficiency 
increases.  Thus,  even  though  the  number  of  points 
per  processor  on  coarse  grids  decreases  more  rapidly 
in  higher  dimensional  problems,  it  should  be  possi¬ 
ble  to  maintain  good  efficiencies  for  multigrid  algo¬ 
rithms  in  higher  dimensions. 

Finally,  we  can  verify  that  agglomeration  should 
occur  when  a  processor  is  responsible  for  two  un¬ 
knowns  along  any  dimension.  In  particular,  for  the 
two  dimensional  problem,  agglomeration  should  oc¬ 
cur  at  level  i  when 

Ae.(n,p)  <  ^Ae<+i(n,^). 

That  is,  agglomeration  should  occur  when  the  loss  of 
efficiency  due  to  communication  is  more  than  that 
due  to  idle  processors.  Substituting  the  efficiency 
of  the  red-black  Gauss-Seidel  smoother  (3)  into  this 


inequality  yields 

3C.(=)’  +  C,(2)-4C3<0. 

Solving,  we  find  that  agglomeration  should  occur 
when  —2.62  <  n/p  <  1.96  »  2. 

Conclusions 

Our  implementation  of  the  multigrid  v-cycle  on 
the  1,024  processors  NCUBE/ten  achieved  84%  effi¬ 
ciency,  demonstrating  that  multigrid  algorithms  can 
be  implemented  on  massively  parallel  computers  ef¬ 
ficiently.  We  note  that  the  algorithm  chosen  for  par¬ 
allelization  is  one  of  the  most  effective  serial  algo¬ 
rithms  for  the  solution  of  partial  differential  equa¬ 
tions.  It  was  not  chosen  because  of  any  inherent 
parallelism. 

We  also  developed  a  model  for  the  efficiency  of 
the  multigrid  v-cycle  that  depends  only  on  the  effi¬ 
ciency  of  the  smoother  at  each  level  and  the  dimen¬ 
sion  of  the  problem.  Using  this  model,  we  showed 
that  with  an  efficient  fine-grid  smoother,  relatively 
high  efficiencies  are  relatively  easy  to  obtain.  In  par¬ 
ticular,  multigrid  efficiencies  of  75%  to  95%  should 
be  attainable  on  most  parallel  computers.  We  also 
concluded  that  similar  multigrid  efficiencies  can  be 
attained  for  higher  dimensional  problems. 

References 

[1]  A.  Brandt.  Multigrid  solvers  on  parallel  com¬ 
puters.  In  M.  H.  Schultz,  editor.  Elliptic  Prob¬ 
lem  Solvers,  pages  39-84,  Academic  Press,  New 
York,  1981. 

[2]  W.  L.  Briggs.  A  Multigrid  Tutorial.  SIAM, 
Philadelphia,  PA,  1987. 

[3]  T.  F.  Chan  and  R.  S.  Tuminaro.  Design  and 
implementation  of  parallel  multigrid  algorithms. 
In  S.  F.  McCormick,  editor,  Multigrid  Meth¬ 
ods:  Theory,  Applications,  and  Supercomputing, 
pages  101-115,  Marcel  Dekker,  Inc.,  1988. 

[4]  T.  F.  Chan  and  R.  S.  'Diminaro.  A  survey  of 
parallel  multigrid  algorithms.  In  A.  K.  Noor, 
editor,  Parallel  Computations  and  Their  Impact 
on  Mechanics,  AMD-86,  pages  155-170,  ASME, 
1988. 

[5]  P.  O.  FVederickson  and  O.  A.  McBryan.  Par¬ 
allel  Supcrconucrgcnt  Muliigrid.  Technical  Re¬ 
port  CTC87TR12,  Cornell  Theory  Center,  1987. 


5«2 


[6]  D.  Gannon  and  J.  van  Rosendale.  On  the  struc¬ 
ture  of  parallelism  in  a  highly  concurrent  PDE 
solver.  J.  Par.  Disi.  Comp.,  3:106-135,  1986. 

[7]  U.  Cartel.  Parallel  Multigrid  Solver  for  SD 
Anisotropic  Elliptic  Ploblems.  Technical  Re¬ 
port  390,  Gesellschaft  Fiir  Mathematik  und 
Datenverarbeitung,  1989. 


[8]  J.  L.  Gustafson,  G.  R.  Montry,  and  R.  E.  Ben¬ 
ner.  Development  of  parallel  methods  for  a  1024- 
processor  hypercube.  SIAM  J.  Sci.  Stai.  Comp., 
9(4):609-638,  1988. 

[9]  D.  E.  Womble  and  B.  C.  Young.  A  Model  and 
Implementation  of  Muliigrid  for  Massively  Par¬ 
allel  Comuputers.  Technical  Report  SAND89- 
2781,  Sandia  National  Laboratories,  1990. 


Table  I.  Effiextncs  tnd  auJ  sealed  effieienep  Jor  ike  multigrid  v-egele  oa  r&e  NCUBE/itn  taken  tohing  a«  n  X  n  prohUm  on 
*  P  X  p  trii  of  proeeteoTt.  Dotket  ( — )  eorretponi  to  cenet  for  vikiek  no  (tmimf  into  esUU.  Effieieneiet  nppenr  heloto  ike  line 
in  ike  inkle;  tenlei  effieieneiet  nppenr  nkotie  ike  line. 


nx  n 

1  X  1 

2x2 

4x4 

8x8 

16x  16 

32x32 

4x4 

1.00 

0.125 

— 

— 

— 

— 

8x8 

1.00 

0.215 

0.047 

— 

— 

— 

16  X  16 

1.00 

0.392 

0.104 

0.027 

— 

— 

32x  32 

1.00 

0.630 

0.244 

0.075 

0.019 

— 

64x64 

1.00 

0.818 

0.449 

0.200 

0.059 

0.015 

128  X  128 

— 

0.915 

0.725 

0.434 

0.049 

256  X  256 

— 

— 

0.869 

0.694 

0.395 

0.146 

512  X  512 

— 

— 

— 

0.858 

0.668 

0.362 

1024  X  1024 

— 

— 

— 

— 

0.844 

0.642 

2048  X  2048 

— 

— 

— 

— 

— 

0.835 

Table  2.  Preiieiei  effieteneiet  for  ike  mnltigni  o-eyele  toloing  am  n  X  n  proklem  on  n  p  X  p  frti  of  proeettorn  ntin§  ike 
rei-kinek  Gnntn-Seiiel  tmooiker. 


n  X  n 

1  X  1 

2x2 

4x4 

8x8 

16x  16 

32x32 

8x8 

1.00 

0.235 

0.049 

0.012 

0.003 

0.001 

16  X  16 

1.00 

0.433 

0.110 

0.029 

0.007 

0.002 

32x32 

1.00 

0.667 

0.254 

0.077 

0.020 

0.005 

64x  64 

1.00 

0.840 

0.499 

0.200 

0.057 

0.015 

128  X  128 

1.00 

0.929 

0.745 

0.435 

0.162 

0.045 

256  X  256 

1.00 

0.969 

0.893 

0.704 

0.383 

0.134 

512  X  512 

1.00 

0.986 

0.957 

0.876 

0.664 

0.339 

1024  X  1024 

1.00 

0.994 

0.983 

0.952 

0.859 

0.627 

2048  X  2048 

1.00 

0.997 

0.993 

0.981 

0.947 

0.842 
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Abstract 

Taha  and  AUovtitz  derived  numerical 
schemes  by  methods  related  to  the  inverse 
scattering  transform  (1ST)  ftv  physically 
important  equations  such  as  the  Korteweg-de 
Vries  (KdV)  and  modified  Korteweg-de  Vries 
(MKdV)  equations.  Experiments  have  shown  that 
die  1ST  numerical  schemes  compare  very 
favorably  with  other  numerical  method.  In  this 
paper  an  accurate  numerical  scheme  based  on  the 
1ST  is  used  to  solve  non-integrable  higher  KdV 
equations,  for  instance; 

li,  +  U*Ut  +  Uja  =  0 


Recently,  there  has  been  a  lot  of  theoretical 
and  numerical  research  in  order  to  investigate  this 
phenomenon  (see  Bona  et  aL  [3],  and  the 
references  there  in).  Numerical  simulations  of 
solutions  of  Eq.  (2)  (see  Fbmbeig  &  Whitham  [4], 
Bona  et  al  [S])  confirm  that  its  solitary-wave 
solutions  are  unstable  if  p  2  4,  and  in  fact,  that 
neighbouring  solutions  emanating  from  smooth 
initial  data  appear  to  form  singularities  in  finite 
time.  This  paper  deals  with  a  numerical 
investigation  of  the  blow-up  for  the  higher  KdV 
equation 

u,  +  u*u,  +  ««.  =  0  (3) 


It  has  been  conjectured  that  the  above  equation 
admits  a  self-focusing  singularity.  The  proposed 
numerical  scheme  is  used  to  investigate  this 
phenomenon.  The  implementation  of  the  1ST 
scheme  leads  to  a  huge  periodic  banded  system  of 
equations  to  be  solved  at  each  time  step,  which 
requires  a  large  amount  of  computing  time  if  a 
serial  cmnputer  is  used.  A  vector  and  parallel 
implementation  of  the  proposed  scheme  on  an 
Intel  iPSC/2  hypocube  is  carried  out,  and  the 
numerical  results  are  discussed. 

1.  Introduction 

It  has  been  shown  that  the  higher  nonlinear 
Schrodinger  (NLS)  equation 

»?r  +  ^  k  1^9  =  0,  p  2  2  (1) 


using  an  accurate  numerical  scheme  based  on  the 
1ST.  The  proposed  numerical  scheme  is  based  on 
an  1ST  numerical  scheme  derived  by  Taha  and 
Ablowitz  for  the  KdV  and  MKdV  equations. 
Experiments  have  shown  that  the  1ST  numerical 
schemes  compare  very  favorably  with  other 
numerical  methods  [6,7]. 

In  order  for  the  singularity  to  be  ffiopcAy 
resolved,  the  mesh  sizes  in  the  directions  (A  x  and 
t  have  to  be  taken  very  small.  TherefOTe  the 
implementation  of  the  proposed  numerical  scheme 
on  a  serial  computer  requires  a  large  amount  of 
computing  time. 

In  this  paper  a  parallel  algorithm  for  the  above 
scheme  is  designed  and  implemented  on  an  Intel 
iPSC/2  hypercube,  and  the  numerical  results  are 
discussed. 


under  certain  conditions  admits  a  self-focusing 
singularity  [1],  which  means  that  the  solution  of 
Eq.  (1)  Mows  up  in  finite  time.  This  suggests  that 
the  higher  nonlinear  KdV  equation 

u,  +  A  m'u,  +  ii„  =  0,  p  >  3  (2) 

has  a  self-focusing  singularity  [2]. 


2.  The  proposed  numerical  scheme 

The  proposed  numerical  scheme  which  is 
based  on  the  1ST  for  Eq.  (3)  is  [6] 


At 


— (ur-V  -  3  ur*' 

2(Ax)’  * 


0-8186-21 13-3/90/0000/0564$01 .00  O  1990  IEEE 
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+  3  u^+i  -  «r+2  +  “r-2  -  3ujr-i 
+  3«r  -  11.-^,  1  -  ;^I(0" 


One  way  to  implement  this  scheme  is  to  solve  a 
periodic  banded  system  of  equations  at  each  time 
step: 


0-3  1 
-10-31 
0-10-31 


+  mt+V  +  “^+2 )  - 


u"  +  u"*'  ■> 

+ +  «r-2);](^^-^)'(4) 


The  tnincation  error  of  this  scheme  is 
0((Atfy  +  0((Axf).  This  scheme  is  applied  to 
Eq.  (3)  subject  to  a  Gaussian  profile  of  the  form 


u(x,0)  =  T\e  ^  , 


-10-3  1 
-1  a  -3 
-1  a 


with  T\  =  3,  and  y  =  8  as  an  initial  condition,  and 
periodic  boundary  conditions  on  the  interval  [-40, 
40]  are  imposed. 

3.  A  parallel  Implementation  of  the  proposed 
scheme. 

Eq.  (4)  can  be  written  as 

-Cl'  +  (3  +  eK"^'  -  3u-t' 


where 


+  C+V  =  -B,  . 


,  -  2(Ax)^ 

®  "  A/  ’ 


where  o  =  3  +  e.  The  above  system  can  be 
solved  on  a  hypercube  by  using  a  modified 
version  of  an  efficient  parallel  algcnithm  for 
banded  systems  [8,9]. 

Another  way  to  implement  the  scheme  on  the 
iPSC/2  system  is  to  use  the  sweqiing/iteration 
technique  presented  in  [6].  To  explain  the 
sweeping/iteration  technique,  we  seek  an  equation 
of  the  form 

(9) 

which  is  suitable  for  computing  explicitly  by 
sweeping  to  the  right  For  stability  |a  |  ^1. 
Repeated  substitution  of  Eq.  (9)  into  Eq.  (6)  to 
eliminate  C+2  •  “iT+V » “  f®’'*  of  “iT-V 
gives 


««  =  -Oi  +  (3  +  e)Mr  -  3u,"_,  +  u,"-2 


+  (o  -3)C*‘  +  (o*  -  3a  +  3  +  E)fir4' 


+  (a’  -  3a*  +  3a  -  ea  -  1)C4' 


+  +  «."2) 


- (c + Cl  +  C2  v]  (7) 


Requiring  the  m"_V  term  to  drop  out  determines  a 
(uniquely  since  |a  |  ^  1)  as  a  solution  of 
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(a  -  1)^  +  M  =  0  (11) 

and  leaves  fw  (at  the  new  time  step)  a  second 
(Hder  difference  equation  given  by 

h,_i  =  (3fl  -  +  ofl,  (12) 

(where  Eq.  (12)  is  obtained  from  Eq.’s  (10)  and 
(11)),  which  is  suitable  for  computing  the  b’s 
e^licitly  by  sweeping  to  the  left.  This  method  is 
well  suited  for  serial  computers  but  not  for 
pipeline  or  vector  systems  due  to  the  recursive 
nature  of  Eq.’s  (9)  and  (12).  To  implement  the 
scheme  on  the  extension  board  of  an  intel  iPSC/2 
hypercube  the  cyclic  reduction  method  [10]  is 
used  to  solve  the  bidiagonal  linear  system  with  a 
non  zero  element  on  the  upper  right  hand  comer 
generated  from  Eq.  (9).  On  the  other  hand,  the 
cyclic  reduction  mediod  for  the  periodic 
tridiagonal  system  generated  from  Eq.  (12)  proved 
to  be  unstabte.  An  efficient  vector  algorithm  such 
as  a  modified  LU  decomposition  for  tridiagonal 
systems  with  partial  pivoting  is  suggested.  It  is  to 
be  noted  that  the  rest  of  the  computations  of  the 
sweq)ing  technique  are  well  suited  for  vector 
operations.  To  implement  the  sweeping  technique 
on  a  parallel  system  such  as  the  hypercube,  the 
systems  generated  from  Eq.’s  (9)  and  (12)  should 
be  solved  by  modified  parallel  algorithms  for 
bidiagonal  and  tridiagon^  systems  respectively 
[8,9]. 

4.  Numerical  Experiments 

The  proposed  numerical  scheme  is 
implement^  on  the  iPSC/2  system.  The  system 
given  in  Eq.  (8)  is  solved  by  an  iterative  SOR 
parallel  algorithm,  and  it  is  found  that  this  method 
does  not  converge.  Then  the  proposed  scheme  is 
implemented  on  the  extension  vector  board  of  an 
intel  iPSC/2  system  by  using  a  cyclic  reduction 
method  for  Eq.  (9),  properly  vectorizing  Eq.  (7)  in 
order  to  calculate  the  B’s  (otherwise  it  will  not 
vectorize  ^sapaXy  and  it  will  give  wrong  results), 
and  leaving  (12)  unvectorized.  My 

preliminary  experiments  indicate  that  the  above 
algcxithm  is  four  times  faster  than  its  serial 
version.  It  is  to  be  noted  that  more  work  has  to  be 
done  in  order  to  vectorize  Eq.  (12).  Also,  more 
work  has  to  be  done  in  order  to  parallelize  the 
sweeping  algorithm.  According  to  my 
preliminary  experiments  the  solution  of  the  higher 
KdV  equation  (3)  blows  up  at  r  =  0.1 1S7  (see 
Fig.  1). 


Figure  1.  Displays  the  evolutitMi  under  Eq.  (3)  dt 
a  Gaussian  profile  given  in  Eq.  (5)  as  an  initial 
condition  on  the  intoval  [-40,  40].  Ax  s  0.0391 
and  At  =  0.0001.  (a)  t  =  0.0,  (b)  t  =  0.0999,  (c) 
t= 0.1145,  (d)  t  =  0.1156. 
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Abstract: 

We  present  a  new  parallel  implementation  of 
explicit  time  stepping  methods  for  time  depen¬ 
dent  equations  in  one  or  two  spatial  dimen¬ 
sions,  The  aim  is  to  minimize  the  number  of 
data  transfers,  to  get  faster  aigorthms.  In  one 
spatial  dimension,  t  explicit  time  steps  on  p 
processors  using  a  grid  of  size  n  need  Oi  t  n/p 
arithmetical  operations  and  Ot  t  i  startup  oper¬ 
ations.  The  triangle  method  also  requires 
O'  t  n  /  p  '  arithmetical  operations  but  only 
O'  t  p/n  I  startup  operations.  In  two  spatial 
dimensions,  using  a  grid  of  size  n  n  and 
given  the  same  algorithm,  the  startup  time  of 
OCt)  operations  using  the  conventional  ap¬ 
proach  is  considerably  reduced  to  O'  x  vp  /  n  ' 
startup  operations.  All  constants  regarding  the 
O  notation  are  less  than  5. 

Introductioix : 

The  efficiency  of  parallel  numerical  algorithms 
should  be  maximal  (  efficiency  sequential 
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runtime/'  parallel  runtime  ♦  number  of  proces¬ 
sors  ).  Therefore  the  amount  of  communication 
should  be  small  .  Communication  consists  of 
the  startup  time  and  the  time  to  transfer  data. 
The  startup  time  is  the  time  to  build  up  a 
connection  between  the  processors.  For  real 
parallel  computers  this  time  is  very  large, 
due  to  software  protocol.  The  following 
examples  illustrate  this; 

PARSYTEC  Super  Cluster  :  750 
INTEL  iPSC2  :  2000 
SUPRENUMl  :  3000 

(  startup  time  measured  in  multiples  of  the 
time  needed  for  1  floating  point  operation; 
SUPRENUM  supports  asynchronous  transfer  of 
data,  which  is  more  general  and  harder  to 
implement  than  sychronous  transfer  ). 

As  a  model  of  computation  we  use  p  proces¬ 
sors  interconnected  by  a  crossbar.  A  hyper¬ 
cube  or  a  reconfigurable  two  dimensional 
mesh  would  also  suffices.  One  floating  point 
operation  has  cost  A.  A  transfer  of  n  numbers 
costs  S*n/B,  where  S  stands  for  startup  and 
B  for  bandwidth.  All  other  operations  have 
cost  0. 
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Implementation  for  one  spatial  dimension: 


Solving  the  following  initial  boundary  value 
problem  '  compare  [l]  ‘ 

u,  >  t.  X  )  =  a  u^^>  t.  X  I 

t>  0  .  X  1  H  to.  1] 

u  I  0,  X  ■  (p  (  X  J  ,  X  f  I 

ul  t,  X  j  =  0  .  t  >  0.  X  •;  I  =  { 0. 1 } 

one  often  gets  explicit  iter-ilix  e  formulas  like 

k  *  k  *  Aic/Ax  { 

2^ik*  k  IJ 
i  '0.  0  k  n,  n»l/Ax 

uok*  <p(kAx)  .  u-o  -  Uj^"  0  . 

The  goal  is  the  computation  of  Uj  .  1  -  i  -  t  . 

0  -  k  n.  This  will  be  done  by  explicit  time 
stepping  as  the  above  formula  suggests.  During 
one  such  step  one  computes  the  values 
O  k  n .  using  (he  results  of  (he  last  iteration 
u.  ,  0  -  k  n  . 

Standard  method: 

Initially  one  distributes  the  gridpoints  over 
the  processors  (0  ..p-l).  During  a  time  step 
processor  x  sends  u^  nx/p  processor  x-1 
and  ^  to  processor  x+1.  After  that 

it  computes  the  values  u-,^^  .  nx/p  r  - 
n(x*l)/p  1.  Thus  X  time  steps  have  the  cost: 
t(4n/pA*2S*2/B) 

Triangle  method: 

Starting  with  the  same  distribution  the  x 
iterations  are  computed  in  blocks  of  [n/(2p)  J 
iterations,  without  changing  the  data  aepend- 
ence.  The  computation  of  one  block  consists 
of  three  steps.  Suppose  (hat  n/p-2m  +  l  and 
processor  x  updates  the  values  u-  , 


q  ^  k  ■  q  ♦n/p  . 

During  the  first  step  processor  x  computes 
Ui.j  ,  lij^m,  q*j-  k'  q*n/p-j  .This  can 
also  be  seen  as  building  up  a  triangle  over 
the  processors  domain. 


iterations  i 


o 

ooo 

ooooo 


q*n/p  -  1 


grid- 
points  k 


values  computed  during  tne  first 
step  Cprocessor  x) 


During  the  second  step  data  are  exchanged 
in  the  following  way.  Processor  x  sends  the 
values  u-.j  j.m'l.q*j-k-q*j*l 

and  u^  ,  u..^  to  processor  x-1. 

During  the  third  step  they  complete  the  last 
m  time  steps:  processor  x  computes  the  values 
Ui.j  ^  ,  1  -  j  -  m  .  q  *  n/p  -  j  -  k  q  ♦  n/p  *]  . 

iterations  i 


oooooooo 

oooooooo 

oooooooo 


q»3  m  »1 


grid 

points  k 


O  values  computed  during  the  third 
step  Cprocessor  x) 

values  computed  during  the  first 


For  the  next  m  steps  processor  x  updates  the 
values  u.  Thus  x  iterations 

cost : 

t4n/pA»[2tp/n|  (S  ♦n/(pB)) 

The  amount  of  data  transferred  and  of 
arithmetic  performed  remains  the  same,  but 


the  number  of  startup  operations  is  consider- 
abely  reduced.  For  example,  for  p  « 32  ,  S  -  750  , 
n  •  t  A- 1 .  B  - 1/28  one  gets  the  efficien¬ 
cies: 

-  44.55%  .  eff^g,^.  95.4% 


Implementation  for  two  spatial  dimei:\sioi\s : 


area  is  a  rectangle  with  lengh  n  and  width 
m  t.  m  «  n  i  it  is  better  to  use  bands.  During 
each  step  the  processors  exchange  their 
borders  and  update  their  whole  area.  For  t 
iterations  one  gets: 
using  squares: 

t(6n2/pA  ♦  4S*  4n/(T^B) )  (13 

using  bands: 

t  (  6nm/p  A  +  2  S+  2m/B  )  (23 


Now  we  consider  the  two  dimensional  problem 
'  compare  Cl]  : 


Uj  t.  X.  y  -  0  .A  u'  t.  X.  y 

t  >  0  .  X.  y  -  Q  1r2  . 

(3  I  0.1  1-1  0.1] 
u  '  0.  X.  y  -  cp  X.  y  x.  y  ’  -  (3 
u' t.  X.  y  ^  X.  y  t  >  0  .  x.  y  .  (3. 

Using  step  sizes  .  .  x.  .  '.  y  and  t  .  the  usual 
5  point  star  to  approximate  .\u  and  the 
forward  differential  quotient  to  approximate 
Uj .  one  gets  : 


U 


r  •  1 


O.-t 

2.x  .'  y 


4U 


i  k 


♦  U"'  ♦u" 

ilk  i'lk  ikl  ik-lf. 


0<i<n.  n-  l/ix  .  0<k<m.  m-l/.L:.y.  r  1. 


U  •  (p  ( 1  Ax  ,  k  A  y  )  . 

I  k 

0<i<n.  0<k<.m. 
U  -  d  '  r  At  .  1  Ax  .  k  A  y 

I  k 

r  1  .  i  .  k  (0 . 1 3. 


To  update  a  point  one  needs  the  values  of  its 
4  direct  neighbours  as  well  as  its  own. 


Triangle  method: 

Dividing  the  area  into  bands  (m«n3.  it  is 
easy  to  reduce  our  method  to  the  one  dimensi¬ 
onal  case.  During  each  round  of  computation 
the  processors  build  up  a  prism  over  their 
domain.  They  exchange  their  borders  and 
during  the  next  round  they  complete  the  last 
iterations  and  build  up  new  prisms  and  so 
on. 


O  plane,  fransfered  to  the  neighbour 


o  prism  of  the  last  computation  step 
Now  T  iterations  cost  only: 

t6nm  /pA*2|xp/n  |s  +  T2m/B 

Comparing  this  with  (23.  the  number  of 
startup  operations  is  reduced  without  increa¬ 
sing  other  operations 


Standard  method: 

Conventionally,  to  assign  processors,  one 
covers  the  area  with  squares  or  bands;  if  the 


If  m  n  it  is  better  to  use  squares,  but  then 
the  method  is  more  difficult.  One  needs 
3  rounds  of  computation  and  4  of  communica- 
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tion  to  compute  n/  2  Vp  time  steps  .  First  each 
processor  builds  up  a  pyramid  over  its  domain. 
The  height  of  the  pyramid  is  n/2  yp  time 
steps.  We  refer  to  fig.  1  .  To  carry  on,  the 
borders  must  be  exchanged.  Each  processor 
sends  the  two  upper  levels  of  face  FI  of  its 
pyramid  to  the  neighbour  in  front  of  it  and 
the  two  upper  levels  of  face  F2  to  its  right 
neighbour.  Thus  one  needs  two  transfers  with 

[  n/2Vp  ]  -  1 

2  '  ■  2  i  ♦  1  ^  =  n^/2p 

i  =  O 

data  per  transfer.  After  the  communication 
each  processor  completes  the  last  n/2Vp 
iterations  as  far  as  possible.  Then  the  fig.  2 
arises  locally: 


insert  the  pyramid 

I 


After  the  transfer  the  remaining  pyramid  can 
be  computed.  For  t  iterations  one  needs; 

T  6nVp  A*  2  X  Vp/n  (  4S*  2n^/{p  B)  ) 
The  number  of  startups  is  much  smaller  than 
in  (1)  .  The  number  of  all  other  operations 
remains  unchanged.  For  example,  for  p  32, 
S-  750,  I  -  n  -  m  -  10^,  A=l,  B  -  1/28  one 
gets  the  efficiencies: 

eff  .27,2%,  eff,„  ,  *44,1%. 

conv  triangle 


fig  1 


Only  the  pyramid  shown  in  fig.  3  remains  to 
be  computed.  The  planes  F5  and  F6  are  in 
the  same  processor.  Thus  only  two  planes  must 
be  transferred.  This  can  be  done  by  two 
transfers  with  n'^/2p  data  each. 

F3  /\  aed 
F4  -  A  dec 
F.5  -  A  bee 
F6  A  aeb 


The  new  implementation  has  one  disadvantage 
in  the  two  dimensional  case.  Because  different 
iterations  are  computed  together,  more  data 
have  to  be  stored  at  the  same  time.  The 
conventional  implementation  gets  problematic 
with  regard  to  the  efficiency  only  for  small 
amounts  of  data  per  processor  (  n^/p  -  10^). 
In  this  case  our  method  is  more  efficient  and 
practicable.  With  a  trick  it  is  possible  to  use 
only  4.5  times  more  memory  than  the 
standard  method  and  a  little  extra  administra¬ 
tion.  The  pyramids  are  discrete  ones,  which 
are  stored  in  planes.  From  plane  to  plane  they 
lose  one  ring.  With  respect  to  the  algorithm 
only  the  two  upper  levels  of  each  face  of  the 
pyramid  are  necessary,  to  continue  the  compu¬ 
tation  Therefore  it  is  enough  to  store  the  two 
most  outside  rings  of  each  plane  Allocating 
two  matrices  of  size  l3n/2vp  i  i3n/2ip  I 
per  processor  all  data  can  be  stored,  as  we 
show  now.  The  planes  with  even  numbers 
are  stored  in  the  first  matrix  and  the  others  in 
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the  second  one.  During  the  first  round  of 
computation  each  processor  works  on  its 
original  domain  (fig.  4 ). 

1 .  S  n  /  Vp 

space  for  the 
data  received 

n/Vp 

—  original  domain 

1 

P 
■'P 

fig.  4  :  dimet^sion  of  the  matrix 

The  data  received  during  transfer  are  stored 
in  a  simular  manner  in  the  remaining  parts 
of  both  matrices  (fig.  51. 


space 
for  FI 


fig.  5  a  :  data  received  during 
the  first  transfer 


Vp 


fig.  6  ;  the  original  domain 
for  the  next  iterations 


During  the  next  block  of  iterations  the 
processor  transfers  its  borders  in  the  other 
direction.  Therefore  all  iteration  can  be 
computed  on  the  two  matrices.  In  the  con- 
vential  implementation  n  /p  space  is  neces¬ 
sary  to  store  the  domain  of  a  processor  and 
4n/Vp  space  to  store  the  borders  exchanged. 
The  Triangle  Method  needs  4.5  n  /p  space 
per  processor,  that  is  no  more  then  4.5  -  fold 
space. 


References ; 


■ 

ss 

m 

j: 

HI 

lii 

ilBI 

i 

■I 

m 

le 

■1 

■ 

■■ 

i 

i 

■ 

■■ 

ni 

■■■ 

Ml 

■■il 

■ 

■1 

mI 

a" 

■ 

■ 

■■ 

■e 

Hi 

IVI 

■ 

1 

{Bg 

IBS' 

■1 

■■ 

■ 

■ 

!■ 

IB 

IBI 

IBI 

■■II 

IBI 

IBI 

■■■ 

B 

a 

■ 

IBB 

IBBi 

■■ 

■I 

■■ 

IVI 

1BI 

!■! 

IBI 

IBI 

■ 

B 

i8b 

■ 

II 

ii 

in 

HI 

»! 

HI 

H 

■1 

in 

IM 

Ill 

111 

11 

III' 

1 

■I 

iii 

ill 

Eli 

iii 

ill 

II 

Ell 

111 

Ml 

Ml 

ni 

Ml 

■1 

■■■ 

El 

III 

111 

111 

II 

III 

ill 

iii 

Iii 

iii 

ii 

ill' 

fig.  5  b  :  data  received  during 
the  second  transfer 


Cl]  Todd,  J.  (1962)  Survey  of  Numerical 
Analysis,  Me  Craw  Hill  Book  Company, 
INC,  New  York  ,  pp.  419 . 


After  n/  2  Vp  iterations  are  computed,  the  pro¬ 
cessor  works  over  a  different  part  of  the 
matrices  ( fig.  6  ). 
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The  Fifth  Distributed  Memory 
Computing  Conference 


22:  Concurrent  Simulation  Paradigms 


A  MINI-SYMPOSIUM  ORGANIZED  BY: 
Anthony  Skjellum  and  Manfred  Morari 
California  Institute  of  Technology 
and 

Sven  Mattisson  and  Lena  Peterson 
Lund  Institute  of  Technology 


Concurrent  Simulation  Paradigms  Mini-Symposium 


The  Concurrent  Simulation  Paradigms  Mini¬ 
symposium  addressed  the  use  of  distributed  memory 
computers  in  the  solution  of  large-scale  systems  of  or¬ 
dinary  differential  and  differential-algebraic  equations 
(ODE’s  and  DAE’s).  The  solution  of  large-scale  sys¬ 
tems  of  parabolic  partial  differential  equations  is  also 
considered  in  the  paper  by  Veuidewalle.  Applications 
from  electrical  and  chemical  engineering  are  presented 
here,  specifically  the  transient  response  of  VLSI  cir¬ 
cuits  and  a  dynamic  flowsheet  simulation  used  on  net¬ 
works  of  distillation  columns. 

The  key  cheiracteristics  shared  by  these  problems  are 
their  large-scale  and  inhomogeneous  nature,  sparse 
connectivity,  stiffness,  and  widely  varying  timescales. 
It  is  not  meaningful  to  “scale”  these  problems. 


TVaditional  numerical  methods  are  considered  in  two 
of  the  papers.  The  Concurrent  DASSL  and  ESACAP 
efforts  utilize  parallelized  sequential  numerical  analy¬ 
sis.  A  novel  numerical  method,  Waveform  Relaxation, 
is  considered  by  four  papers  and  realized,  for  instance, 
in  the  CONCISE  VLSI  circuit  simulator.  Speedups 
achievable  on  medium-grain  multicomputers  are  com¬ 
pared  and  discussed.  The  paper  of  Nevanlinna  centers 
on  more  theoretical  aspects  of  Waveform  Relaxation 
for  high  performance  circuit  simulation. 

Actual  implementations  of  the  algorithms  described 
here  are  discussed  for  the  iPSC/2  and  Symult  s2010 
multicomputers . 


Beat  Concurrent 
AlgDrithm't  Overhead 


The  Concurrency  Diagram  illustrates  the  trade-offs  between  the  “best”  parallelized  sequential  algorithm  and  the  “best” 
concurrent  algorithm.  The  former  has  a  higher  sequential  fraction,  but  lower  overhead  compared  to  the  latter.  The 
“best”  concurrent  algorithm  has  additional  (parallelizable)  overhead,  but  a  smaller  sequential  fraction,  allowing  it  to 
achieve  higher  speedups  when  many  nodes  are  used  (large-resource  limit,  beyond  p"). 
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Waveform  Relaxation  Methods 
for  Solving  Parabolic  Partial  Differential  Equations 

Stefan  Vandewalle* 

Department  of  Computer  Science.  Katholieke  Universiteit  Leuven 
Celestijnenlaan  200A.  B-3030  Heverlee,  Belgium 


Abstract 

The  numerical  solution  of  a  parabolic  partial 
differential  equation  is  usually  calculated  by  a 
time  stepping  method.  This  precludes  the  efficient 
use  of  vectorization  and  parallelism,  if  the  problem 
to  be  solved  on  each  time  level  is  not  very  large. 
In  this  paper  we  present  an  algorithm  which  over¬ 
comes  the  limitations  of  the  standard  marching 
schemes  by  solving  for  the  solution  on  all  the  time 
levels  simultaneously.  The  method  is  applicable  to 
linear  and  nonlinear  problems  on  arbitrary 
domains.  It  can  be  used  to  solve  initial-boundary 
value  problems  as  well  as  time-periodic  equations. 

1.  Introduction 

Standard  parabolic  marching  schemes  are  gen¬ 
erally  classified  as  either  explicit,  implicit  or  semi- 
implicit,  depending  on  the  discretization  of  the  time 
derivative.  We  have  compared  the  parallel  charac¬ 
teristics  of  several  classical  techniques  in  [13].  The 
results  can  be  summarized  as  follows. 

•  The  explicit  methods  are  highly  parallel.  Parallel 
efficiencies  close  to  optimal  can  easily  be  obtained. 
They  suffer  however  from  a  severe  stability  con¬ 
straint.  which  necessitates  the  use  of  very  small 
time  steps  and  makes  them  less  attractive  for  solv¬ 
ing  large  problems. 

•  The  implicit  methods  transform  the  problem  into 
an  elliptic  partial  differential  equation  that  has  to 
be  solved  on  each  time  level.  The  multigrid 
method  can  be  used  to  solve  these  equations  very 
rapidly,  see  [3].  This  method  uses  a  hierarchy  of 
fine  and  coarse  grids.  The  fine  grid  operations  can 
be  performed  very  efficiently  on  a  multiprocesser. 
It  is  much  more  difficult  to  parallelize  the  coarse 
grid  operations  since  parallel  overheads  cannot  be 
neglected,  see  e.g.  [ll]. 


•  In  the  semi-implicit  methods,  the  problem  of 
solving  one  very  large  system  of  equations  for  each 
time  step  is  reduced  to  the  problem  of  solving 
many  decoupled  tridiagonal  systems.  Various 
parallel  algorithms  are  based  on  substructuring 
and  cyclic  reduction.  Their  arithmetic  complexity 
is  approximately  twice  that  of  the  best  sequential 
algorithm.  This  limits  their  parallel  efficiency. 

Each  of  the  methods  can  be  parallelized 
efficiently  for  problems  that  are  large  enough.  In 
that  case  the  best  sequential  algorithm  is  also  the 
best  parallel  one.  For  (relatively)  small  problems, 
only  the  explicit  methods  retain  their  parallel 
efficiency.  However,  they  are  limited  by  the  stabil¬ 
ity  constraint  and  therefore  not  competitive.  The 
best  standard  methods,  which  are  of  second  order 
implicit  type,  perform  unsatisfactorily.  They 
suffer  from  a  high  communication  complexity  and 
hardly  take  advantage  of  the  available  parallel 
computing  power. 

New  algorithms  are  therefore  needed  for  solv¬ 
ing  parabolic  problems  on  large  scale  parallel 
machines.  These  algorithms  should  either  improve 
the  numerical  quality  of  the  explicit  methods  or 
increase  the  parallel  efficiency  of  the  fast  implicit 
methods.  The  latter  can  be  obtained  by  calculating 
the  solution  on  several  or  all  time  levels  at  once. 
The  waveform  relaxation  technique,  to  be 
presented  in  section  2.  belongs  to  this  class.  We 
will  discuss  its  application  for  solving  initial¬ 
boundary  value  problems  in  section  3.  In  section  4 
we  will  consider  the  solution  of  time-periodic  par¬ 
abolic  equations.  We  have  implemented  the 
method  on  an  Intel  hypercube.  Some  implementa¬ 
tion  aspects  will  be  discussed  in  section  5.  In  sec¬ 
tion  6  we  will  illiistrate  the  method  by  two  exam¬ 
ples  and  compare  its  performance  to  that  of  a 
parallel  implementation  of  the  best  standard 
method. 


*  Research  assisUnt,  National  Science  Foundation  (Belgium) 
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2.  The  waveform  relaxation  method  u(x,?o)  =  «o 


(3.1.C) 


Waveform  relaxation  (WR),  also  called 
dynamic  iteration  or  Picard-Lindelof  iteration  [8], 
is  a  technique  for  solving  large  systems  of  ordinary 
differential  equations.  We  will  explain  the  method 
by  its  application  to  the  following  system, 

■^yi  =  fi(t.yi . yN)  (2.1) 

with  yi(to)  =  yio.  i=l.....N  for  t  €  [to.tf]. 

The  Jacobi  variant  of  the  WR  algorithm  can  be  for¬ 
mulated  as  follows. 


n  :-0 

choose  for  t  €  [to.tf]  and  i=l,...,N 

repeat 
for  each  i: 

solve  . y{Jll.y<'*^‘\yW . y^">) 

with  y{«»"‘>(to)  =  y« 
n  :*  n+1 

until  convergence. 


The  adaptation  of  the  algorithm  to  obtain  a 
Gauss-Seidel  or  SOR  type  iteration  is  straightfor¬ 
ward.  In  the  iteration  step  each  differential  equa¬ 
tion  is  solved  as  an  equation  in  one  unknown.  As 
such  the  method  is  very  similar  to  the  iterative 
techniques  for  solving  algebraic  systems. 

The  theoretical  foundations  of  the  WR  method 
have  been  discussed  in  a  number  of  papers.  In  [14] 
convergence  is  proven  for  nonlinear  systems.  The 
authors  concentrate  on  the  systems  of  ordinary 
differential  equations  that  arise  in  the  problem  of 
simulating  VLSI  devices.  For  these  systems  the 
method  has  shown  to  be  very  effective.  An 
analysis  for  linear  systems  is  given  by  Miekkala 
and  Nevanlinna  in  [8].  Further  convergence  results 
are  given  in  [5],  in  which  the  relation  is  established 
between  the  number  of  iterations  and  the  accuracy 
order  of  a  partially  converged  solution. 

3.  Initial  botmdary  value  problems 

3.1  Standard  waveform  relaxation 

We  consider  the  following  parabolic  equation 

+  Liu)  -  fi  (x,t)  €  fl  X  [to.tf]  (3.1.a) 


where  Cl  C  IR",  Z  is  an  elliptic,  possibly  non¬ 
linear  operator  and  B  is  the  boundary  operator. 
After  spatial  discretization  and  incorporation  of 
the  boundary  conditions,  the  parabolic  problem  is 
transformed  into  a  system  of  ordinary  differential 
equations  with  one  equation  at  each  grid  point. 

+  L(U)  =  F  .  U(to)  =  Uo.  (3.2) 

dt 

U  is  the  vector  of  unknown  functions  defined  at 
the  grid  points.  L  is  the  operator  derived  from  L 
by  discretization  and  F  is  the  vector  of  functions 
determined  by  f^  and  f2. 

The  standard  WR  algorithm  may  be  applied  to 
solve  (3.2).  For  instance,  in  the  case  of  a  five-point 
finite  difference  discretization  of  the  heat  equation, 
and  with  use  of  the  Jacobi  algorithm,  the  equation 
to  be  solved  at  each  grid  point  (xi.yj)  is  written  as 


(Ax)^ 


This  is  a  simple  first  order  differential  equation 
which  can  be  solved  using  any  standard,  stiff  ODE 
integrator. 

Attempts  to  use  WR  in  the  way  described 
above,  to  solve  parabolic  problems  have  not  been 
very  successful.  This  is  due  to  the  slow  conver¬ 
gence  of  the  method.  Indeed,  as  was  shown  in  [8], 
the  convergence  rates  for  the  Jacobi  and  Gauss- 
Seidel  scheme  are  of  order  1— O(h^),  where  h  is  the 
mesh  size  parameter.  In  contrast  to  the  case  of  a 
linear  system  of  equations  arising  from  a  discre¬ 
tized  elliptic  equation,  overrelaxation  in  a  SOR 
fashion  does  not  lead  to  significantly  improved 
convergence  characteristics. 


3.2  Multigrid  Waveform  Relzocation 


3.2.1  General  Idea.  The  convergence  can  be 
accelerated  if  WR  is  combined  with  the  multigrid 
idea  [7.10,12].  For  a  description  of  the  standard 
multigrid  method  we  refer  to  [3].  The  method 
differs  from  the  other  iterative  techniques  in  that  it 
uses  a  set  of  nested  grids,  with  the  finest  one 
corresponding  to  the  one  on  which  the  solution  is 
desired.  Its  superior  convergence  characteristics  are 
based  on  the  interplay  of  fine  grid  smoothing. 
which  annihilates  high  frequency  errors,  and 
coarse  grid  correction,  which  is  applied  to  reduce 
the  low  frequency  errors. 


Biu)  =  /2 


(x.t)  €  dn  X  [to.tf]  (3.1.b) 


The  method  is  extended  to  time  dependent 
problems  in  the  following  way.  Each  of  the 


multigrid  operations  is  adapted  to  operate  on  the 
entire  functions  Uij(t)  instead  of  on  single  scalar 
values. 

•  The  smoothing  is  performed  by  applying  one  or 
more  Gauss-Seidel  or  damped  Jacobi  waveform 
relaxations.  Smoothing  rates  for  these  relaxations 
have  been  given  in  [7]. 

•  The  defect  of  an  approximation  U  is  defined  as 

D  =  -^U  +  L(D)  -  F.  (3.3) 

dt 

The  calculation  of  the  derivative  in  the  computa¬ 
tion  of  the  defect  can  be  avoided.  The  application 
of  a  standard  WR  step  to  an  approximation 
resulting  in  an  improved  approximation 
corresponds  to  a  calculation  of  the  following  type. 

+  N  =  M  +  F.  (3.4) 

dt 

where  N  and  M  satisfy  L  »  N  -  M.  The  defect. 

D  =  -  F.  (3.5) 

dt 

can  then  be  calculated  easily  as. 

-  M  (U^“’  -  U^“+i>).  (3.6) 


•  The  restriction  and  prolongation  are  calculated 
using  identical  formulae  as  in  the  elliptic  case. 
However  these  formulae  now  operate  on  functions 
instead  of  on  single  values.  As  an  example,  we  for¬ 
mulate  the  two-dimensional  WR  full- weighting 
restriction  operator,  in  stencil  notation. 


Uij(t) 
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1  2  1 
2  4  2 
1  2  1 


Uij(t). 


(3.7) 


where  Uu(t)  and  Uij(t)  are  corresponding  grid 
point  functions  on  the  coarse  and  the  fine  grid. 


We  will  first  state  the  equivalent  algorithm  of 
the  muUigrid  correction  scheme,  which  is  used  for 
solving  linear  problems.  The  equivalent  of  the  full 
approximation  scheme  for  solving  nonlinear  prob¬ 
lems  will  be  given  afterwards. 


3.2.2  Multigrid  correction  scheme.  Let  G}. 
i  =  0.  1.  ' '  • .  k  be  the  hierarchy  of  grids  with  Gt 
the  finest  grid  and  Go  the  coarsest  grid.  Equation 
(3.2)  or  equivalently. 

^  +  Lk  Uk  =  Fk.  Uk(to)  =  Uko  (3.8) 
dt 

is  solved  by  iteratively  applying  the  following 


algorithm  to  an  initial  approximation  of  Uk. 

procedure  mgm  rJk,Fk,Uk.> 

if  (k  =  0);  soLve  -“Uj  +  Uo  =  Fo  exactly 
at 

else 

-  perform  smoothing  operations 

-  compute  the  defect:  Dk  i=  "^Uk  +  Lk  Uk  —  Fk 

dt 

-  project  the  defect  on  Gk_i:  Fk_i  •=  if”*  Dk 

-  solve  on  Gk-i:  ^Uk_i  +  Lk-i  Uk-i  =  Fk-i 

repeat  times  mgm  fifc-I,Fk_i,Uk-ii, 
starting  with  Uk_i  0. 

-  interpolate  the  correction  to  Gk  and  correct  Uk.' 

Uk-=Uk-lf_,Uk_j 

-  perform  V2  smoothing  operations 
endif 


The  algorithm  is  completely  defined  by  specify¬ 
ing  the  grid  sequence  Gi.  i=0.....k.  the  discretiz^ 
operators  Lj,  the  inter-grid  transfer  operations 
and  ll"*,  the  nature  of  the  smoothing  relaxations, 
and  by  assigning  a  value  to  the  constants  Vi .  V2  and 
yj.  So-called  V-  and  W-multigrid-cycles  are 
obtained  with  the  values  1  and  2  for  >).  Another 
choice  leads  to  the  F-cycle.  The  algorithm  can  be 
combined  with  the  idea  of  nested  iteration.  The 
initial  approximation  to  the  problem  on  Gj  is  then 
derived  from  the  solution  obtained  on  Gj_i.  This 
leads  to  the  waveform  equivalent  of  the  full  mul¬ 
tigrid  method. 

3.2.3  Full  approximation  scheme.  The  algorithm 
was  extended  to  nonlinear  parabolic  problems  in 
[10].  The  nonlinear  algorithm  is  easily  derived 
from  the  well-known  multigrid  full  approxima¬ 
tion  scheme  and  is  presented  at  the  end  of  this  sec¬ 
tion. 

The  derivative  calculation  in  the  determination 
of  the  coarse  grid  problem  right  hand  side  can  be 
avoided.  Ind^d.  when  the  two  restriction  operators 
Ik”*  and  Tk  are  equal,  the  two  derivatives  cancel. 
The  right  hand  side  of  the  problem  on  the  coarse 
grid  may  then  be  calculated  by  the  following  for¬ 
mula, 

Fk-i  «=  Lk-i(Uk-i)  -  I^*(Lk(Uk)  -  Fk).  (3.9) 
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procedure  /as  Ct.F^.UfcJ 

if(k.  =  0);  solve  “Uo  +  Lo(  Uo)  =  Fj,  exactly 
at 

else 

-  perform.  i/|  smoothing  operations 

-  project  Ufc  onto  Gk_i:  >=  Ik  'Un 

-  calculate  the  coarse  problem  right  hand  side: 

-  ir  (^+Lk(U,>-F,) 

-  solve  on  Gk-j.-  -^Uk_i  +  Lk_i(Uk_i)  =  F^-i 

at 

repeat  yk  times /os  CJfc-i.Fk-i.Uk-i^, 
starting  with  Uk-i  *=  Uk_i. 

-  interpolate  the  correction  to  Gk  and  correct  Uk.' 

Uk  '=  Uk  +  Il[_,  (Uk-i  -  Uk-i) 

-  perform  V2  smoothing  operations 

endif 


4.  Time-periodic  parabolic  problems 

4.1  The  standard  algorithms 

In  this  section  we  will  consider  the  parabolic 
problem  (3.1a-c)  where  the  initial  condition  (3.1.c) 
is  replaced  by  the  periodicity  condition 

itU.to)  =  (4.1) 

This  problem  is  of  considerable  importance  in  vari¬ 
ous  areas  of  practical  interest,  such  as  wing  flutter, 
ferro-conductor  eddy  currents,  chemical  reactor 
theory,  pulsating  stars,  and  fluid  dynamics.  Vari¬ 
ous  algorithms  have  been  proposed  to  compute  the 
stable  periodic  solutions.  One  approach  is  a  time- 
integration  of  the  studied  system,  starting  from  an 
arbitrary  initial  condition,  until  a  stable  periodic 
orbit  is  reached.  This  brute  force  method  may  how¬ 
ever  be  prohibitively  expensive  in  the  case  of 
slowly  decaying  transients.  A  second  approach 
consists  of  using  difference  methods  where  a  large 
system  of  nonlinear  algebraic  equations  is  obtained 
after  discretization.  This  system  may  be  solved 
with  direct  or  iterative  sparse  solvers.  [9].  A  third 
and  commonly  used  approach  is  based  on  the 
shooting  method  [4].  Finally,  a  very  fast  algo¬ 
rithm  was  presented  by  Wolfgang  Hackbush  in  [2], 
in  which  the  periodic  problem  is  reformulated  as 
an  integral  equation  and  solved  by  the  multigrid 
method  of  the  second  kind.  We  will  briefly  review 
this  algorithm.  In  section  6.  it  will  be  used  to  com¬ 
pare  a  new.  WR  based  algorithm  with.  We  will 
restrict  our  attention  to  the  linear  case  as  the 


nonlinear  algorithm  is  very  similar. 

4.2  Multigrid  method  of  the  second  kind 

The  solution  of  the  linear  initial-boundary 
value  problem  (3.1.a-c).  restricted  to  tf.  can  be 
written  as  the  outcome  of  an  affine  mapping 
applied  to  the  initial  condition  Uq* 

My  =  Uo  +  /.  (4.2) 

A  is  a  linear  integral  operator,  such  that  A  uq 
equals  u(x.fy).  the  solution  to  (3.1.a-c)  with  homo¬ 
geneous  right  hand  sides  (/i=0  and  /2=0),  while 
f(x)  equals  u(*,ty).  the  solution  to  (3.1.a-c)  with 
zero  initial  condition  (mo=0).  With  this  notation, 
the  periodicity  condition  (4.1)  becomes 

y  =  A  y  +  /,  (4.3) 

where  y(x)  is  a  function  on  Cl.  The  determination 
of  a  function  y  satisfying  (4.3)  is  equivalent  to  the 
problem  of  finding  a  function  u  that  satisfies 
(3.1.a-b)  and  (4.1).  Indeed,  if  y  fulfills  (4.3).  then 
the  solution  u  of  the  initial  boundary  value  prob¬ 
lem  (3.1.a-c)  with  mq  =  y.  is  the  solution  of  the 
time-periodic  problem. 

(4.3)  is  a  Fredholm  integral  equation  of  the 
second  kind  and  may  be  solved  by  the  very  fast 
multigrid  method  of  the  second  kind.  We  refer  to 
[3]  for  an  in  depth  analysis  of  this  technique  and 
for  a  discussion  of  various  applications.  In  a  simi¬ 
lar  way  as  in  the  multigrid  method  for  elliptic 
equations,  (4.3)  is  discretized  on  a  set  of  grids.  Gj. 
i  =0 . k,  resulting  in  a  set  of  discrete  equations. 

Yi  =  K,  Yi  +  Fi.  on  Gj.  (4.4) 

The  problem  on  the  fine  grid  is  solved  by  itera¬ 
tively  applying  the  following  algorithm  to  an  ini¬ 
tial  approximation  of  Yk. 


procedure  mgm_2nd  fJk.Fk.Yk.) 

if(k  =0);  solve  Yj  =  K*  Yq  +  Fq  exactly 

else 

-  smoothing:  Yk  '=  Kk  Yk  +  Fk 

-  compute  the  defect:  Dk  '=  Yk  —  Kk  Yk  —  Fk 

-  project  the  defect  on  Gk-i."  Fk-i  '=  ij”*  Dk 

-  solve  on  Gk-i-  Yk-i  =  Kk-i  Yk-i  +  Fk-i 

repeat  2  times  mgm_2nd  CJfc-i,Fk-i,Yk_il, 
starting  with  Yk_i  0. 

-  Interpolate  the  correction  to  Gk  and  correct  Uk-' 

Yk  >=  Yk  -  Yk-i 

endif 
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No  explicit  representation  of  the  discretized 
integral  operator  Ki  is  required.  Indeed,  applica¬ 
tion  of  K]  to  a  function  Yj  is  equivalent  to  calcu¬ 
lating  the  solution  of  one  discrete  initial-boundary 
value  problem  defined  on  G}.  Ki  Yi  may  thus  be 
computed  by  using  standard  parabolic  solvers, 
such  as  a  time-stepping  method,  or  by  using  the 
waveform  relaxation  algorithm  of  section  3. 

In  [3]  the  convergence  rate  of  the  algorithm  is 
shown  to  be  of  the  order  OCCAx^)^).  where  Ax|^  is 
the  fine  grid  mesh  size.  As  such,  one  iteration  step 
is  usually  sufficient  to  solve  (4.3)  to  discretization 
accuracy.  (To  obtain  this  result,  some  mild  res¬ 
trictions  on  the  size  of  time  increment. 
Ati.  i=0.....k.  have  to  be  taken  into  account,  in 
order  to  guarantee  a  sufficient  smoothing  behavior 
of  the  time  discretization  formula.)  It  can  easily 
be  shown  that  the  arithmetic  complexity  of  one 
iteration  of  the  algorithm  is  of  the  same  order  as 
the  complexity  of  solving  an  initial-boundary 
value  problem  on  the  fine  grid. 

43  A  waveform  relaxation  algorithm 

Spatial  discretization  of  (3.1.a-b)  and  (4.1) 
leads  to  the  following  system  of  ordinary 
differential  equations. 

^  +  L(U)  =  F  .  U(to)  =  U(t,).  (4.5) 

dt 

This  system  may  be  solved  with  a  waveform 
relaxation  algorithm  that  is  only  slightly  different 
from  the  algorithm  discussed  in  section  2.  Instead 
of  repeatedly  solving  an  ordinary  differential  equa¬ 
tion  of  initial  value  type  at  each  grid  point,  one 
repeatedly  solves  the  following  periodic  differential 
equation 

+  (L(U))ij  =  fjj  .  Uij(to)  —  Uij(tf).  (4.6) 
ov 

This  problem  may  be  solved  e.g.  by  a  discretization 
method,  resulting  in  a  sparse  matrix  equation. 
Application  of  a  implicit  one-step  discretization 
method  leads  to  an  easily  solvable,  almost  bidiago¬ 
nal  matrix  equation. 

The  modified  waveform  relaxation  can  be  used 
as  such,  or  can  be  integrated  as  a  smoother  into  any 
of  the  multigrid  schemes  of  section  3.  Numerical 
evidence  shows  that  the  latter  leads  to  a  rapidly 
converging  iteration,  with  typical  multigrid  con¬ 
vergence  rates. 


5.  Implementation  aspects 
5.1  Parallelization 

We  have  implemented  the  WR  algorithms  on  an 
Intel  iPSC/2-VX  hypercubc.  For  a  description  of 
this  multiprocessor,  its  hardware  characteristics 
and  various  performance  benchmarks,  we  refer  to 
[l].  The  implementation  is  discussed  in  great  detail 
in  our  studies  [11.12.13].  In  this  paper  we  will 
only  go  over  some  of  the  main  issues. 

A  classical  data  decomposition  is  used  to  evenly 
distribute  the  computational  workload.  The  pro¬ 
cessors  are  arranged  in  a  rectangular  array  and  are 
mapped  onto  the  domain  of  the  partial  differential 
equation.  Each  processor  is  responsible  for  doing  all 
computations  on  the  grid  points  in  its  part  of  the 
physical  domain.  During  the  computation,  com¬ 
munication  with  neighboring  processors  is  needed 
to  update  local  boundary  values.  Various  other 
communications  strategies  may  further  be  used  to 
improve  the  parallel  performance.  We  want  to 
mention  in  particular  the  use  of  an  agglomeration 
strategy  to  reduce  the  commtmication  complexity 
of  the  coarse  grid  operations,  [ll]. 

In  the  WR  method  each  grid  point  is  associated 
with  an  unknown  function.  Uij(t).  In  our  imple¬ 
mentation.  such  a  function  is  represented  as  a  vec¬ 
tor  of  function  values  evaluated  at  equidistant  time 
levels.  We  denote  the  vector  length  by  nt  (number 
of  time  intervals).  The  arithmetic  complexity 
increases  linearly  with  the  value  of  nt.  In  the  same 
way.  the  total  length  of  the  messages  exchanged 
during  the  computation  is  proportional  to  the  vec¬ 
tor  length.  The  number  of  message  exchanges, 
however,  and  the  sequential  overhead  due  to  pro¬ 
gram  control  are  independent  of  nt.  From  the  high 
message  startup  time  on  most  parallel  machines 
(and  in  particular  on  the  iPSC/2).  it  is  clear  that 
the  communication  time  to  calculation  time  ratio 
will  decrease  with  increasing  function  length  and 
that  the  parallel  efficiency  will  improve. 

In  figure  3.2.  we  present  typical  speedup  values 
(Sp)  measiured  on  a  16  processor  machine.  Two 
curves  are  drawn,  one  for  a  waveform  multigrid 
cycle,  with  nt  =  50.  and  one  for  a  standard  mul¬ 
tigrid  cycle,  as  it  is  used  for  solving  elliptic  prob¬ 
lems.  (The  particular  problem  that  is  solved,  is 
discussed  in  [12].)  The  subsuntial  performance 
difference  is  due  to  the  very  different  calculation  to 
communication  ratios. 
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Figure  5.1.  Typical  speedups  for  waveform  mul¬ 
tigrid  and  standard  multigrid  cycles 

5.2  Vectorization 

The  tise  of  a  vector  processor  in  each  computer 
node  may  result  in  a  substantial  reduction  of  com¬ 
puting  time.  Indeed,  most  of  the  waveform  mul¬ 
tigrid  operations  can  be  expressed  as  simple  arith¬ 
metic  operations  on  functions,  i.e.  on  vectors,  see 
e.g.  the  restriction  operator  (3.7).  In  contrast  to 
the  standard  approach  we  do  not  vectorize  in  the 
spatial  direction  but  we  vectorize  in  the  time  direc¬ 
tion.  The  vector  speedup  of  the  arithmetic  part  of 
the  computation  will  mainly  depend  on  the  value 
of  the  vector  length  parameter  n^.  It  will  be  virtu¬ 
ally  independent  of  the  size  of  the  spatial  grid,  the 
number  of  multigrid  levels,  the  multigrid  cycle 
used  and  the  number  of  processors.  This  is  in 
sharp  contrast  with  standard  multigrid  vectoriza¬ 
tion  results,  see  e.g.  [6].  Standard  vectorization 
does  not  lead  to  a  performance  improvement  unless 
the  number  of  grid  points  per  processor  is  very 
large.  Its  application  is  therefore  of  very  limited 
use  on  large  scale  parallel  processors. 

As  a  second  advantage  of  our  approach  we  may 
mention  the  ease  of  implementation.  As  the  vector 
operations  at  a  each  grid  point  involve  the  vectors 
at  neighboring  grid  points  only,  no  complex  grid 
restructuring  (as  in  the  standard  approach)  is 
needed. 

The  only  operation  which  is  not  perfectly  vec- 
torizable  is  the  core  of  the  ODE  integrator,  which  is 
used  in  the  smoothing  step,  and  which  is 
inherently  sequential.  It  will  therefore  reduce  the 


possible  gain  through  vectorization.  It  can  be 
shown  that  the  non-vectorizable  part  of  the  calcu¬ 
lation  makes  out  at  most  10%  of  the  total  compu¬ 
tation.  This  leads  to  a  possible  vector  speedup  of 
10.  or  more. 


6.  Numerical  examples 


6.1  An  initial-boundary  value  problem 

We  consider  the  solution  of  the  following 
initial-boundary  value  problem. 


dt  dx^  dxdy  ay2 


(6.1) 


defined  on  fl  =  [0,l]x[0,l]  for  t  €  [0.0.5],  with 
Dirichlet  conditions  on  the  northern,  eastern  and 
southern  boundary  and  a  Neumann  condition  on 
the  boundary  to  the  west.  The  right  hand  side 
function  f  is  chosen  in  such  a  way  that  the  solution 
is  equal  to 

u(t.x,y)  =  sin(5x+y+10t)  e”^*. 


For  this  problem  we  will  compare  the  perfor¬ 
mance  of  the  WR  method  with  a  parallel  imple¬ 
mentation  of  the  "best"  sequential  method,  the 
Crank-Nicolson  method.  Our  implementations  of 
both  methods  are  highly  optimized  and  are  of  simi¬ 
lar  complexity.  In  both  cases  multigrid  is  xised 
with  a  four-color  nine-point  Gauss-Seidel 
smoother,  standard  coarsening  to  a  3  by  3  coarse 
grid,  full  weighting  restriction,  bilinear  interpola¬ 
tion  and  a  coarse  grid  solver  that  performs  2 
Gauss-Seidel  iterations.  A  constant  time  step.  At. 
is  chosen  for  the  Crank-Nicolson  method,  similar 
to  the  time  step  xised  to  represent  the  functions  in 
the  WR  method.  In  this  example.  At  was  set  to 
0.01.  independent  of  the  spatial  discretization.  This 
leads  to  a  vector  length  of  50.  In  the  WR  method 
we  use  the  trapezoidal  rule  to  solve  the  differential 
equations. 


Some  results,  obtained  on  a  16  processor 
machine,  are  depicted  in  figure  6.1.  The  graphs 
show  the  accuracy  of  the  solution  (largest  error  at 
the  grid-points)  versus  execution  time.  The  figures 
show  smooth  curves  for  the  WR  method.  The  error 
of  the  initial  waveform  approximation  gradually 
decreases  as  more  and  more  multigrid  cycles  are 
applied.  The  Crank-Nicolson  results  show  up  as 
discrete  points.  The  Crank-Nicolson  solution  pro¬ 
cess  is  advanced  time  step  per  time  step  in  a  total 
of  t  seconds.  The  accuracy  of  the  result  is 
represented  by  a  "+"  -sign  at  position  (t,error)  in 
the  figure. 
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Figure  6.1.  Comparison  of  Crank-Nicolson  and  WR  Multigrid  execution  times 
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Table  6.1  ;  Execution  time,  in  seconds,  of  the  full  multigrid  solver  on  1  and  16  processors 

17  by  17  problem 

33  by  33  problem 

method 

Waveform  Relaxation 
Crank-Nicolson 

36.00  4.30  8.4 

58.72  14.23  4.1 

Table  6.2  ;  Execution  time,  in  seconds,  of  the  WR  FMG  V(l.l)  solver  (on  16  processors) 

17  by  17  problem 

33  by  33  problem 

65  by  65  problem 

nt 

scalar  vector  Sp 

scalar 

vector 

Sp 

scalar 

vector 

ESI 

100 

3.54  1.12  3.16 

8.20 

2.42 

3.39 

(na) 

(na) 

(na) 

50 

1.91  0.76  2.51 

4.34 

1.59 

2.73 

11.88 

4.02 

2.96 

25 

1.10  0.59  1.86 

2.43 

1.20 

2.03 

6.53 

2.91 

2.24 

10 

0.62  0.49  1.27 

1.29 

0.97 

1.33 

3.21 

2.25 

1.43 

Depending  on  the  cycle  type  used,  different  exe¬ 
cution  times  are  needed.  As  such  several  results 
are  presented  for  each  technique.  They  are  anno¬ 
tated  in  figure  6.1  in  the  following  way;  with  "WR 
F(l,l)  with  FMG"  we  mean  "waveform  relaxation 
using  F-cycles  with  1  pre-  and  1  post-smoothing 
step  and  the  full  multigrid  technique  with  1  cycle 
at  each  grid  level";  with  "C-N,  2  F(l,l)"  we  mean 
"Crank-Nicolson  method  with  2  F(l,l)  cycles  per 
time  step*.  Two  sets  of  curves  are  given  for  the 
WR  method.  The  dashed  lines  represent  the  results 
obtained  with  vectorization,  while  the  solid  lines 
represent  the  results  obtained  in  scalar  execution 
mode. 

On  16  processors,  WR  turns  out  to  be  faster 
than  the  Crank-Nicolson  method  by  a  factor  of  8 
(for  the  65  by  65  problem)  up  to  a  factor  of  10 
(for  the  17  by  17  problem).  This  is  due  to  the 
smaller  arithmetic  complexity  of  the  waveform 
method,  its  superior  parallel  characteristics  and  the 
use  of  vectorization. 

In  table  6.1  we  have  tabulated  the  execution 
time  of  the  full  multigrid  solver  with  one  \'^(1,1) 
cycle  on  each  grid  level,  on  1  and  on  16  processors. 
We  have  also  added  the  parallel  speedup.  Sp. 
Waveform  relaxation  outperforms  the  standard 
method  by  a  factor  of  approximately  1.7.  on  a  sin¬ 
gle  processor.  This  is  due  to  the  smaller  arithmetic 
complexities  of  the  smoothing  and  the  defect  calcu¬ 
lation  steps  (which  account  for  a  factor  of  approxi¬ 
mately  1.5).  a  reduced  initialization  cost  (some 
intermediate  results  may  be  retained  when  setting 
up  system  (3.2))  and  the  lower  computational 


overheads  associated  with  program  control  (loop 
overhead,  indexing  overhead,  procedure  call  over¬ 
head.  etc.).  An  additional  factor  of  2.  and  higher, 
results  from  the  better  parallel  characteristics  of 
WR  method.  This  is  easily  seen  from  the  speedup 
figures. 

The  remaining  performance  difference  is  due  to 
vectorization.  In  table  6.2  we  give  the  execution 
times  of  the  WR  full  multigrid  solver.  The  depen¬ 
dence  of  the  vector  speedup  on  the  vector  length  is 
obvious.  It  should  be  noted  that  for  problems  of 
this  size  on  a  16  processor  machine,  standard  vec¬ 
torization  in  the  Crank-Nicolson  method  would 
not  lead  to  any  speedup  [6]. 


A  time-periodic  problem 


We  consider  the  parabolic  partial  differential 
equation. 


au  _ 
at  di^ 


(6.2) 


with  the  following  time  periodicity  condition, 
u(0,x,y)  -  ud.x.y).  defined  on  the  unit  square 
with  four  Dirichlet  boundary  conditions.  The 
function  f  is  chosen  such  that  the  solution  of  the 
PDE  equals 

u(t.x.y)  =  (x— x*)^(y— y^)^sin(2iTt) 


In  figure  (6.2)  we  represent  the  timing  results 
obuined  on  a  16  processor  hypercube.  (No  vector¬ 
ization  was  used  for  this  example.) 
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Figure  6.2.  Comparison  of  waveform  relaxation  method  and  multigrid  method  of  the  second  kind 


Three  methods  are  compared.  The  first  method 
is  a  parallel  implementation  of  the  multigrid 
method  of  the  second  kind,  as  proposed  by  Hack- 
bush.  The  second  order  backward  differentiation 
method  (BDF(2))  is  used  for  time  integration.  It 
has  excellent  smoothing  properties  and  is  of  high 
accuracy.  The  mesh  size  on  each  grid  G|,  Ax|.  is 
determined  by  standard  coarsening  from  a  fine 
grid,  G^,  with  65,  33,  or  17  grid  lines  in  x-  and  y- 
direction.  The  time  increment,  Ati.  is  chosen  equal 
to  the  mesh  size.  The  linear  systems  obtained  by 
the  BDF(2)  scheme  on  each  time-level  are  solved 
by  using  the  standard  multigrid  method,  with  the 
2  V(l,l)  cycles  or  full  multigrid.  Thanks  to  its 
(XCAx)^)  convergence  rate,  one  iteration  of  the 


method  is  sufficient  to  solve  the  problem  to  discret¬ 
ization  accuracy. 

Various  programming  techniques  are  applied  to 
opi^ize  the  parallel  performance  of  the  imple¬ 
mentation.  In  particular,  an  agglomeration  tech¬ 
nique  is  used  to  reduce  the  parallel  overhead  of  the 
coarse  grid  operations,  [ll]. 

A  related  method  is  obtained  if  the  multigrid 
WR  algorithm  is  used  as  the  smoother  inside  the 
multigrid  method  of  the  second  kind.  The  result¬ 
ing  algorithm  is  between  2  and  3  times  as  fast  the 
method  with  time  stepping,  as  can  been  seen  in  fig. 
6.2.  This  was  to  be  expected,  from  the  results  of 
section  6.1. 
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The  periodic  multigrid  WR  method  shows  to  be 
faster  than  the  best  standard  algorithm,  by  a  fac¬ 
tor  of  7  to  10.  This  is  in  part  due  to  its  lower  arith¬ 
metic  complexity.  If  the  complexity  of  solving  the 
initial-value  problem  (3.1.a-c)  on  the  fine  grid 
is  denoted  by  W^,  it  may  be  shown  that  the  cost  of 
the  multigrid  algorithm  of  the  second  kind  is 
approximately  2.5  W^.  whereas  the  execution  of  a 
full  multigrid  WR  step  is  only  1  W^.  The  remain¬ 
ing  performance  difference  results  from  the  better 
parallel  characteristics  of  the  WR  method.  As  was 
noted  in  the  introduction,  it  is  difficult  to 
efficiently  parallelize  coarse  grid  operations.  The 
multigrid  method  of  the  second  kind  visits  the 
coarse  grid  very  frequently,  because  of  its  "double 
multigrid"  nature.  It  is  basically  a  multigrid  W- 
cycle,  where,  in  each  smoothing  step,  a  large 
number  of  elliptic  problems  are  solved  by  standard 
multigrid.  Consequently,  the  algorithm  is  not  well 
suited  for  parallel  implementation. 

We  should  also  note  that  vectorization  will  lead 
to  an  additional  speedup,  in  the  case  of  the  WR 
algorithm  only.  The  performance  difference  on  the 
16  processor  machine  will  then  be  in  the  range  of 
25  to  50.  depending  on  the  problem  size. 

7.  Concluding  remarks 

The  transformation  of  the  parabolic  problem 
into  the  sequential  process  of  solving  small  prob¬ 
lems  on  successive  time  levels,  seriously  degrades 
parallel  efficiency  of  the  standard  marching 
schemes.  While  they  can  be  used  efficiently  for 
problems  with  a  very  large  number  of  grid  points 
per  processor,  they  perform  totally  unsatisfac¬ 
torily  for  small  problems  and  large  numbers  of 
processors. 

We  have  presented  several  methods  based  on 
waveform  relaxation.  They  show  multigrid  con¬ 
vergence  speeds  and  can  be  efficiently  implemented 
on  parallel  machines.  As  an  added  advantage  they 
can  be  straightforwardly  vectorized  even  if  the 
number  of  grid  points  per  processor  is  very  small. 
As  such  they  are  perfectly  fit  for  implementation 
on  massively  parallel  machines. 
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Abstract 

The  DC  analysis  and  transient  analysis  parts 
of  the  general  electrical  circuit  analysis  pro¬ 
gram  ESACAP  are  parallelized  to  run  on  the 
Intel  iPSC.  Most  of  the  program  runs  un¬ 
changed  on  the  cube  manager.  Only  the  solu¬ 
tion  of  the  systems  of  nonlinear  algebraic  equa¬ 
tions  is  parallelized  to  run  on  the  hypercube 
parallel  computer.  The  nonlinear  equations 
arise  either  from  the  DC  problem  or  from  the 
discretization  of  the  differential  equations  by 
backward  differentiation  formulas. 

Circuit  equations  are  allocated  to  processors 
to  balance  the  load  of  function  evaluation  and 
LU-factorization  algorithm  used  by  the  New¬ 
ton  iteration  algorithm.  Using  p  processors, 
a  speed-up  of  approximately  p/2  can  be  ob¬ 
tained.  The  performance  of  the  parallel  pro¬ 
gram  is  demonstrated  by  simulating  a  4  x  4  bit 
digital  multiplier  circuit  leading  to  180  circuit 
equations. 

1  Techniques  for  Parallel 
Circuit  Analysis 

The  time  domain  analysis  of  analogue  circuits 
or  digital  circuits  at  the  circuit  level  involves 
the  numerical  solution  of  systems  of  nonlinear 
ordinary  differential  equations.  The  differen¬ 
tial  equations  are  usually  stiff  which  implies 
that  implicit  numerical  integration  formulas 
must  be  used.  Besides,  models  of  electrical 
circuits  often  lead  to  coupled  systems  of  dif¬ 
ferential  and  algebraic  equations,  and  conse¬ 
quently  each  time  step  involves  the  solution 
of  a  system  of  nonlinear  algebraic  equations 
including  the  discretization  of  the  differential 
equations. 

Let  a  circuit  be  described  by  the  following 
implicitly  given  system  of  differential  algebraic 


equations, 

f(t,y,y')  =  o  (1) 

where  f  :  R  x  x  R^.  When 

discretized  by  the  simple  backward  Euler  for¬ 
mula,  the  following  nonlinear  algebraic  system 
is  obtained, 

f(tn,yn,(yn-yn-\)/h)  =  0  (2) 

where  h  is  the  stepsize  in  time,  h  =  —  <„_i 

and  y„  ss  y(f„). 

The  solution  of  (2)  is  obtained  by  some  it¬ 
erative  method,  usually  of  Newton  type, 

^  y(m)  _  (3) 

where  /„(y)  =  /(<n,y.(y  -  yn-i)/h)  and 
Kiy)  =  df„{y)/dy. 

The  main  computational  task  involved  in 
the  Newton  iteration  is  the  evaluation  of  the 
nonlinear  vector  function  /  and  the  nonlinear 
matrix  function  F'.  The  matrix  F'  is  generally 
sparse,  and  this  is  exploited  in  the  solution  of 
the  linear  equations  indicated  by  the  matrix 
inverse.  The  computational  complexity  of  the 
linear  equation  solution  is  proportional  to  N'‘ 
where  k  is  in  the  interval  from  2  to  3  and  N  is 
the  dimension  of  the  problem  (1).  The  solu¬ 
tion  of  the  linear  equations  therefore  becomes 
relatively  larger  compared  with  the  evaluation 
of  /  and  F'  when  N  increases. 

The  obvious  approach  to  a  parallel  version 
called  direct  parallelization  is  to  compute  / 
and  F'  in  parallel  and  solve  the  linear  equa¬ 
tions  in  parallel.  This  simply  amounts  to  a 
parallel  version  of  the  Newton  iteration  (3). 
The  vector  and  matrix  functions  parallelize 
well,  but  the  general  sparse  system  of  linear 
equations  is  hard  to  solve  efficiently  on  a  par¬ 
allel  computer. 

An  alternative  approach  to  a  parallel  nu¬ 
merical  solution  of  (1)  is  the  so  called  wave¬ 
form  relaxation  [1),[2].  In  this  approach. 
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the  system  (1)  is  decomposed  into  a  number 
of  loosely  coupled  subsystems  of  differential 
equations,  and  each  subsystem  is  solved  inde¬ 
pendently  over  the  same  time  window.  The 
process  is  repeated  a  number  of  times,  and 
after  each  iteration,  information  is  exchanged 
between  the  subsystems. 

Waveform  relaxation  was  not  intended  pri¬ 
marily  for  parallel  computers,  but  it  was  ob¬ 
served  in  [2]  that  the  method  maps  well  on 
medium  grain  distributed  memory  computers, 
like  the  Intel  iPSC  Hypercube.  Waveform  re¬ 
laxation  is  very  efficient  for  the  simulation 
of  digital  MOS  circuits,  and  it  has  been  at¬ 
tempted  for  a  broader  class  of  problems.  How¬ 
ever,  the  same  gain  in  efficiency  is  not  always 
obtained  and  there  may  be  problems  with  con¬ 
vergence. 

In  the  parallel  version  of  waveform  relax¬ 
ation,  one  or  several  subsystems  are  allocated 
to  each  processor.  For  certain  problems,  the 
subsystems  become  so  large  that  it  is  neces¬ 
sary  to  apply  several  processors  to  solve  one 
subsystem.  Otherwise,  the  result  would  be 
very  poor  load  balance.  In  this  case,  the  di¬ 
rect  parallelization  approach  can  be  applied  to 
a  subsystem. 

This  paper  exploits  the  possibilities  of  di¬ 
rect  parallelization  for  the  following  reasons. 
If  an  efficient  approach  can  be  devised,  it  can 
be  used  in  the  parallelization  of  numerous  ex¬ 
isting  simulation  programs  like  ESACAP.  The 
balance  between  computations  which  paral¬ 
lelize  well  and  computations  which  do  not  par¬ 
allelize  well  may  turn  out  to  be  favourable, 
leading  to  a  good  overall  speed-up.  Last,  di¬ 
rect  parallelization  of  large  blocks  in  a  wave¬ 
form  relaxation  program  can  extend  the  class 
of  problems  for  which  this  very  powerful  ap¬ 
proach  is  useful. 

2  ESACAP 

ESACAP  [3]  is  a  general  purpose  circuit  analy¬ 
sis  program  primarily  developed  for  DC,  tran¬ 
sient  and  periodic  steady-state  analysis  of 
nonlinear  electrical  circuits.  For  linear  cir¬ 
cuits,  it  also  offers  frequency  response  and 
zero/pole  computation.  The  circuits  are  de¬ 
scribed  in  a  flexible  input  language  which  per¬ 
mits  the  specification  of  general  nonlinear  rela¬ 
tionships.  The  Jacobian  F'  is  computed  ana¬ 
lytically  by  ESACAP  on  the  basis  of  the  input 


specification. 

The  circuit  description  is  transformed  into 
modified  nodal  equations  [4]  which  are  re¬ 
ordered  to  fill  diagonal  zeros.  Then  the  Ja¬ 
cobian  F'  is  reordered  to  reduce  fill-ins  dur¬ 
ing  LU-factorization.  The  DC  problem  is 
solved  by  a  hybrid  method  [5],  and  the  dif¬ 
ferential  equations  are  integrated  numerically 
by  backward  differentiation  formulas  automat¬ 
ically  selected  from  orders  1  to  6. 

Figure  1  shows  an  example  of  input  for 
ESACAP,  a  full  adder  realized  with  transmis¬ 
sion  gates  [6].  The  circuit  description  is  hierar¬ 
chical  with  the  transistor  model  at  the  lowest 
level,  then  the  gates  and  finally  the  adder.  The 
transistors  can  be  modelled  according  to  the 
needs,  but  usually  a  suitable  model  is  found 
in  a  library. 

The  problem  used  for  benchmarks  through¬ 
out  this  paper  is  a  corrected  version  of  the  mul¬ 
tiplier  found  on  p.  345  in  [6].  It  is  composed  of 
full  adders  as  specified  in  Figure  1,  half  adders 
trivially  derived  from  the  full  cidder  and  simple 
C-MOS  gates. 

3  Parallel  ESACAP  -  Out¬ 
line 

In  this  project,  only  the  numerical  solution 
of  ordinary  differential  equations  (transient 
analysis)  has  been  chosen  for  parallelization. 
The  backward  differentiation  formulas  used  in 
ESACAP  do  not  permit  parallelization  ’’across 
the  method”  but  only  parallelization  ’’across 
the  system”  [7].  This  means  that  the  system 
of  equations  (1)  is  partitioned  into  groups  of 
equations  where  each  group  is  assigned  to  a 
different  processor. 

The  original  version  of  ESACAP  runs  on  the 
iPSC  Cube  Manager.  In  order  to  simplify  the 
modifications,  it  was  decided  to  confine  the 
parallelization  to  the  subroutine  implementing 
Newton  iteration  (3).  The  original  Newton  it¬ 
eration  performed  for  each  time  step  of  the 
numerical  integration  is  substituted  by  a  sub¬ 
routine  providing  data  for  the  32  node  iPSC 
which  runs  a  parallel  Newton  iteration  loop. 

The  iteration  scheme  (3)  is  implemented  in 
three  different  versions.  The  first  version  is  a 
true  Newton  iteration  as  specified  by  (3).  The 
second  version  is  a  pseudo  Newton  iteration 
where  F4(yn°^)  is  used  through  all  iterations 
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$$DES 

#  AUXILIARY  FUNCTIOIS 
$FUNCTIOI:  RAMP(X, A.B,G,E) ; 

ARG=EXP(A*(X-G)); 

RAMP=B*((ARG-1)/(ARG+1))+E; 

EXD; 

IFUMCTIOH:  AF1(UGS,UT, APHI) ; 
AF1=RAMP(UGS .8/APHI . 1/3 . 
UT-APHI/4,1/3); 

END; 

$FUHCTIOH:  AF2G1(UGS,UT) ; 

AF2G1=RAMP<UGS . 10, 1/4,UT, 1/4) ; 

END; 

$FUNCTION: 

CGS3  (UGS ,  UDS .  UT ,  COX .  APHI ,  CGSO  ) ; 
CGS3=CGS0+C0X*IFGT(UDS . 0 , 

AFl ( UGS , UT , APHI ), AF2G 1 ( UGS , UT) ) ; 

«  IFGT: 

#  IF  UDS>0  THEN  AF1(..  ELSE  AF2G1(.. 
END; 

$FUNCTION :  CGD3 (UGS . UDS , UT , COX , CGDO) ; 
CGD3=CGD0+C0X*IFLT(UDS , 0 . 
AF2G1(UGS,UT),0); 

#  IFLT: 

#  IF  UDS<0  THEN  AF2G1(..  ELSE  0 
END; 

$FUNCTION;  CGB3(UGB, UT. COX, APHI ,CGBO) ; 
Ul=UT-APHI/2; 

CGB3=CGB0+ 

COX*RAMP(UGB , - 1 , 0 . 425 , U1 , 0 . 426 ) ; 

END; 

« 

$FUNCTION: 

IDS2U3(GSEFF,DS,BETA,LAMBD) ; 

IDS2U3= IFLT (GS EFF , 0 , 0 , IFLT (DS , GSEFF , 
2*BETAf(GSEFF-DS/2)*DS, 
BETA*GSEFF*GSEFF)*(1+LAMBD*DS) ) ; 

END; 

#  NOS  TRANSISTOR  MODEL 

#  FIRST  LEVEL  SPICE  MODEL 
$MODEL:  M0S2U3(DRAIN, GATE. SOURCE) : 
NPTYP , UT. BETA , GAMMA , LANBD . APHI , BS , 
CGDO. CGSO, CGBO, COX; 

#  DEFAULT  PARAMETERS 
DEF(HPTYP=1 , UT=1 . 0 , BETA=4 . 8U, 

GANNA=0 . 206 , LANBD=0 . 03 . APHI=0 . 636 , 
BS=0,CGDO=1F,CGSO=1F,CGBO=1F.COX=1F); 

#  THRESHOLD  VOLTAGE 
UTR=UT*HPTYP+ 

GAMMA*(SQRT(NPTYP*APHI-HPTYP*BS)- 

SQRT(NPTYP*APHI)); 

«  DRAIN-SOURCE  CURRENT 

JDS  (  DRAIN ,  SOURCE) =NPTYP-»  IDS2U3  ( 


NPTYP* V(G ATE , SOURCE) -UTR , 

NPTYP*V(DRAIN. SOURCE) .BETA, LANBD) ; 

«  CAPACITANCES 
CGADR(GATE.DRAIN)=CGD3( 

NPTYP* V(GATE, SOURCE) , 

NPTYP*V(DRAIN, SOURCE) ,UTR, COX, CGDO) ; 
CGASO (GATE. SOURCE) = 

CGS3(NPTYP*V(GATE. SOURCE) , 
HPTYP*V(DRAIN, SOURCE) , 

UTR, COX, APHI, CGSO)+ 
CGB3(NPTYP*(V(GATE,S0URCE)-BS) , 

UTR, COX, APHI, CGBO); 

END; 

»  C-MOS  INVERTER 
$MODEL:INV(IN,OUT,REF.UDD); 

XI (OUT , IN , REF) =M0S2U3 (UT=0 . 9 , BETA=40U , 
GAMNA=0 . 3 . LANBD=0 . 05 . APHI=0 . 6 , 

CGD0=9F , CGS0=9F , CGB0=9F , COX= 16F) ; 

X2 (OUT , IN , UDD) =MDS2U3 (NPTYP=- 1 , UT=-0 . 9 , 
BETA=40U, GAMMA= . 4.LAMBD= . 05 , APHI=-0 . 6 . 
CGD0=9F , CGS0=9F . CGB0=9F , COX= 16F) ; 

END; 

«  C-NOS  TRANSMISSION  GATE 
SMODEL;  TGATE(IN,OUT,T,HT) ; 

XI (IN , T , OUT) =M0S2U3 ( UT=0 . 9 , BETA=40U , 
GAMMA=0 . 3 , LAMBD=0 . 06 , APHI=0 . 6 , 

CGD0=9F , CGS0=9F , CGB0=9F , COX= 16F) ; 

X2 (IN , NT , OUT) =M0S2U3 ( NPTYP=- 1 , UT=-0 . 9 , 
BETA=40U. 6ANMA= . 4 , LAMBD= . 06 , APHI=-0 . 6 , 
CGD0=9F , CGS0=9F , CGB0=9F , COX* 16F) ; 

END; 

«  C-NOS  FULL  ADDER 

$NODEL:  ADDER(A, B,C, SUN, CARRY. UDD) ; 

X1(A,NA,MREF,UDD)=INV; 

X2(B,APB,NA,A)=INV; 

X3(B,RAPB,A.RA)=INV; 

X4(B,APB.NA,A)*TGATE; 

X5(B,NAPB.A,RA)=TGATE; 

X6(C,NC,HREF,UDD)=INV; 

X7 (NC , NSUM , NAPB , APB) =TG ATE ; 

X8 (C , NSUM , APB . N APB) =TGATE ; 

X9 (NSUM , SUM , NREF , UDD) =IMV ; 
X10(NB.NCARRY,NAPB,APB)=TGATE; 
X11(HC.NCARRY,APB,NAPB)=TGATE; 
X12(HCARRY, CARRY, NREF, UDD)=INV; 
X13(B.NB.NREF,UDD)*INV; 

END; 

$$STOP 

Figure  1:  ESACAP  description  of  a  transmis¬ 
sion  gate  adder. 


for  y„.  The  last  version  tries  to  use 

for  the  computation  of  yn,yn+i<  -  a®  far  as 

possible. 

If  the  Newton  type  iterations  fail  to  con¬ 
verge,  the  program  falls  back  on  a  parallel 
version  of  the  hybrid  method  [5]  primarily  de¬ 
signed  for  DC  analysis.  This  means  that  par¬ 
allel  DC  analysis  comes  ’’free”. 

A  number  of  circuit  equations  (components 
of  the  vector  function  /)  and  the  correspond¬ 
ing  rows  of  the  Jacobian  F'  are  allocated  to 
each  processor.  The  factorization  of  the  lin¬ 
ear  equations  is  based  on  the  same  processor 
allocation.  The  pivot  rows  are  broadcast  to 
all  processors  one  by  one,  and  the  elimination 
takes  place  in  parallel. 

The  parallelization  of  ESACAP  served  three 
purposes.  First,  to  gain  experience  with  the 
porting  of  large  sequential  programs  to  a  par¬ 
allel  computer.  Only  a  small  part  of  ESACAP 
had  to  be  modified  and  other  parts  rearranged, 
and  this  is  believed  to  be  a  typical  situation. 
The  parallelization  also  involved  the  splitting 
of  data  structures,  and  in  the  case  of  ESACAP 
this  task  was  nontrivial.  Second,  to  measure 
the  overall  speed-up  of  a  parallel  nonlinear  in¬ 
tegration  routine  for  stiff  systems  of  ordinary 
differential  equations.  It  is  very  difficult  to 
solve  sparse  linear  equations  efficiently  in  par¬ 
allel,  but  the  overall  speed-up  is  still  accept¬ 
able  when  this  part  is  non-dominant. 

Finally,  the  purpose  was  to  get  a  test  vehicle 
for  testing  various  parallel  sparse  linear  equa¬ 
tion  solvers.  The  powerful  input  language  of 
ESACAP  permits  the  specification  of  models 
of  problems  from  a  variety  of  areas  to  generate 
test  problems  for  the  equation  solver. 

4  Functions  and  Jacobian 

The  functions  specified  in  the  input  language 
of  ESACAP  are  converted  into  an  internal 
form  based  on  reverse  Polish  notation.  The 
vector  function  /„  of  (3)  is  computed  from  this 
internal  representation,  and  also  the  deriva¬ 
tives  required  for  F,(  are  computed  analyti¬ 
cally  from  the  same  representation. 

The  derivatives  are  computed  for  the  ac¬ 
tual  vector;  no  symbolic  representation  of 
the  derivatives  exists.  This  approach  leads  to 
less  efficient  evaluation  of  the  nonlinear  mod¬ 
els  than  the  traditional  approach  where  the 
nonlinear  functions  and  derivatives  are  imple¬ 


mented  as  Fortran  subroutines.  However,  it 
gives  the  user  the  ultimate  freedom  in  spec¬ 
ifying  nonlinear  relations,  and  besides  it  im¬ 
proves  the  potential  of  speed-up  from  paral¬ 
lelization,  because  it  shifts  the  computational 
complexity  from  the  solution  of  linear  equa¬ 
tions  towards  the  computation  of  nonlinear 
functions. 

The  fundamental  building  blocks  in  the  in¬ 
put  language  of  ESACAP  are  two  terminal  el¬ 
ements,  and  the  circuit  modelled  by  these  el¬ 
ements  is  represented  by  modified  node  equa¬ 
tions  [4].  This  means  that  each  two  termi¬ 
nal  element  in  general  appears  in  two  equa¬ 
tions,  or  if  one  terminal  is  grounded  in  only 
one  equation.  If  two  node  equations  includ¬ 
ing  the  same  nonlinear  two  terminal  element 
are  allocated  to  two  different  processors,  this 
results  in  a  duplicate  computation  of  the  cor¬ 
responding  function.  The  minimization  of  du¬ 
plicate  computation  is  one  of  the  objectives  of 
processor  allocation. 

A  serious  problem  of  duplicate  computation 
of  nonlinear  functions  and  poor  load  balance 
may  be  caused  by  the  modelling  of  the  power 
supply  of  a  digital  circuit.  Approximately 
half  of  the  transistors  may  be  connected  to 
the  voltage  source  modelling  the  power  sup¬ 
ply  leading  to  a  node  equation  containing  half 
of  the  nonlinear  functions  of  the  total  circuit. 

The  problem  is  not  handled  automatically 
in  the  present  parallel  version  of  ESACAP. 
However,  it  is  easily  solved  manually  by  du¬ 
plicating  the  power  supply  voltage  source  suf¬ 
ficiently  many  times  to  reduce  the  the  maxi¬ 
mum  number  of  functions  of  the  node  equa¬ 
tions  of  the  voltage  sources.  The  penalty  is 
a  larger  number  of  node  equations  which  is  a 
low  penalty  in  this  connection. 

5  Sparse  Matrix  Solver 

The  parallel  sparse  matrix  solver  is  based  on 
the  original  sparse  matrix  solver  of  ESACAP. 
Electrical  circuits  will  in  general  lead  to  non- 
symmetric  matrices,  but  since  they  are  close  to 
being  structurally  symmetric,  they  are  treated 
as  such  by  ESACAP.  This  leads  to  simpler  and 
more  efficient  processing  of  the  sparse  matri¬ 
ces.  The  reordering  to  reduce  fill-ins  is  based 
on  the  third  scheme  of  Tinney  and  Walker  [8]. 
It  is  a  symmetric  row  and  column  reordering 
which  preserves  diagonal  elements  in  the  diag- 
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onal. 

The  reordered  matrix  is  distributed  to  the 
processors  according  to  a  row  interleaved 
scheme.  With  p  processors,  rows  k,k  +  p,k  + 
2p, ..  are  allocated  to  processor  k  for  k  = 
l,2,..,p.  This  is  a  standard  scheme  which  as¬ 
sures  as  uniform  load  distribution  as  possible 
during  LU-factorization. 

The  basic  LU-factorization  of  an  TV  x  iV  ma¬ 
trix  A  can  be  outlined  as  follows, 

lor  i:=l  to  N  do 

for  j:=i+l  to  N  do 
begin 

eliminate  column  i 
in  ro8  j  using  row  i; 
save  pivot  element  in  ACj.i] 
end 

Algorithm  I 

The  algorithm  of  processor  k  executing  a 
parallel  LU-factorization  based  on  row  inter¬ 
leaved  processor  allocation  has  the  following 
outline, 

lor  i:*l  to  H  do 

begin 

{if  row  i  is  on  processor  k 
il  i  in  [k,k+p,k+2p, . then 
broadcast  row  i 
else  receive  row  i; 

{row  j  is  the  first  row  on 
processor  k  where  j>i} 
j:=((i+p-k)  div  p)*p+k; 
while  j<=H  do 
begin 

eliminate  column  i 
in  row  j  using  row  i; 
save  pivot  element  in  A[j,i]; 
j:=3+P 
end 

end 

Algorithm  2 

The  parallel  LU-factorization  involves  both 
calculation  and  communication,  and  the  exe¬ 
cution  time  using  p  processors  can  be  modelled 
as  follows,  ignoring  terms  in  lower  orders  of  TV, 

TpLu  =  ■^'y^N^Tp/p  + 

{N  -Ddin+^-^yN/B)  (4) 


The  average  fraction  of  nonzero  elements  of 
a  row  of  the  sparse  matrix  is  denoted  by  j. 
The  floating  point  execution  time  is  denoted 
by  Tp  and  the  start-up  time  of  communication 
by  Tq.  B  denotes  communication  bandwidth 
(words/sec)  and  d  is  the  dimension  of  the  hy¬ 
percube  parallel  computer  (p  =  2“^). 

The  analogous  execution  time  model  for  Al¬ 
gorithm  1  is 

Tslv  = 

If  the  communication  term  of  (4)  could  he 
ignored,  the  speed-up,  Slu  =  Tsw/Trlu^ 
would  be  equal  to  p.  Unfortunately,  this  is 
rarely  the  case  since  To  24Tf’  on  the  iPSC. 
Therefore  the  opposite  situation,  where  com¬ 
munication  time  dominates  over  computation 
time,  is  more  likely  to  arise,  especially  when 
the  matrix  is  very  sparse  (7  1).  In  this 

situation,  speed-up  decreases  when  more  pro¬ 
cessors  are  applied  (d  increases). 

When  the  coefficient  matrix  of  a  system  of 
linear  equations  is  factored  into  an  LU  prod¬ 
uct, 

LUx  =  b 

the  solution  is  computed  by  a  forward  sub¬ 
stitution,  y  =  L~^b,  followed  by  a  backward 
substitution,  x  =  U~^y.  A  parallel  version  of 
the  forward  substitution  is  outlined  in  the  fol¬ 
lowing  algorithm  running  on  processor  k.  L  is 
stored  in  the  lower  half  triangular  part  of  A. 

lor  i:=l  to  N  do 
begin 

{if  row  i  is  on  processor  k  ..} 
il  i  in  [k,k+p,k+2p, . .]  then 
begin 

yCi]  :=b[i]  ; 
broadcast  yCi] 
end 

else  receive  y[i3; 

{row  j  is  the  first  row  on 
processor  k  where  j>i}’ 
j:=((i+p-k)  div  p)*p+k: 
while  j<=H  do 
begin 

b[j]  :=b[j]-L[j  ,i]*y[i3 ; 
j:=j+P 
end 

end 

Algorithm  3 
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The  backward  substitution  is  very  similar 
and  will  not  be  shown.  The  execution  time 
model  of  the  complete  parallel  solution  algo¬ 
rithm  is  as  follows, 

Tps  =  -I-  2{N  -  l)d(To  +  l/B)  (5) 

The  parallel  solution  algorithm  is  dominated 
by  communication  cost  to  even  larger  extent 
than  the  LU-factorization.  Because  the  num¬ 
ber  of  broadcasts  of  the  solution  algorithm  is 
twice  the  number  of  broadcasts  of  the  factor¬ 
ization  algorithm,  the  execution  time  of  the 
solution  algorithm  may  exceed  the  execution 
time  of  the  factorization  algorithm.  This  is 
most  unsatisfactory  since  the  complexity  of  so¬ 
lution  is  0{N^)  while  the  complexity  of  factor¬ 
ization  is  0{N^). 

A  significant  improvement  is  obtained  by  in¬ 
cluding  forward  substitution.  Algorithm  3,  in 
the  LU-factorization  Algorithm  2.  This  is  ac¬ 
tually  the  classical  Gaussian  elimination,  and 
the  improvement  is  obtained  by  including  the 
y[»]  values  in  the  pivot  rows  which  are  broad¬ 
cast.  This  way,  half  of  the  communications 
of  the  solution  algorithm  is  saved.  The  full 
advantage  of  this  is  gained  in  the  true  New¬ 
ton  iteration  version  of  (3).  The  advantage  is 
less  for  the  pseudo  Newton  iteration  where  the 
Gaussian  elimination  is  only  used  in  the  first 
iteration  where  the  Jacobian  is  computed  and 
LU-factorized. 

The  execution  time  model  (4)  includes  the 
factor  1/p  to  reflect  the  parallel  work  per¬ 
formed  by  p  processors.  Each  processor  is 
responsible  for  the  elimination  of  N/p  rows, 
but  if  the  number  of  non-zero  elements  of  a 
column  is  less  than  the  number  of  processors 
(jN  <  p),  some  processors  will  be  idle.  There¬ 
fore  the  effective  number  of  processors  may  be 
less  than  p.  This  phenomenon  together  with 
the  relatively  high  cost  of  the  start-up  of  a 
communication,  To,  motivates  a  modification 
of  Algorithm  2  into  a  block  version. 

The  processor  allocation  of  the  rows  is 
changed  such  that  blocks  of  6  consecutive  rows 
are  allocated  in  an  interleaved  scheme.  Proces¬ 
sor  k  will  therefore  hold  rows  {k  —  l)b+\,(k  — 
1)6  -1-  2, ..,  kb,(k  -  I  +  p)b  +  l,(k  -  I  p)b  + 
2,..,(k  +  p)b, ..  Algorithm  2  is  modified  to  re¬ 
ceive  and  broadcast,  respectively,  blocks  of  6 
rows  in  stead  of  single  rows.  The  number  of 
communications  is  then  reduced  by  a  factor  of 
6,  and  the  amount  of  work  between  communi¬ 


cations  is  increased  by  a  factor  of  6. 

The  general  step  of  the  block  LU- 
factorization  on  processor  k  can  be  described 
informally  as  follows.  Receive  a  block  of  b 
pivot  rows  and  perform  elimination.  The  to¬ 
tal  time  for  this  step  is  Tp-  If  processor  k  is 
going  to  supply  the  next  block  of  pivot  rows, 
the  pivot  rows  are  first  applied  to  this  block 
(total  time  Ts).  Then  elimination  within  the 
block  of  pivot  rows  is  performed  (total  time 
Tt),  and  the  block  of  pivot  rows  are  broad¬ 
cast  (total  time  Tc).  Finally,  the  elimination 
with  the  block  of  pivot  rows  last  received  is 
completed. 

The  execution  time  models  of  the  elimina¬ 
tion  steps  are  as  follows. 


Tb 

=  p^N^Tp/p 

(6) 

Tb 

=  by'^N^Tp 

(7) 

Tt 

=  l(b-l)y^N^Tp 

(8) 

Tc 

=  dTo(N/b-l)+^-dyN^/B 

(9) 

resulting  execution  time  of  the  block 

ver- 

sion  of  the  parallel  LU-factorization  algorithm 
is  therefore, 

Tbw  =  Tb  -k  Tb  -k  Tt  +  Tc  (10) 

The  limitations  of  this  rather  crude  model 
are  discussed  in  Section  7. 

The  execution  time  of  the  solution  algo¬ 
rithm  Tps  given  in  (5)  is  composed  of  a  com¬ 
putation  and  a  communication  term.  The 
communication  term  is  divided  by  the  num¬ 
ber  of  rows  in  a  block  6  when  blocking  is  intro¬ 
duced,  and  the  computation  term  is  essentially 
unchanged. 

However,  the  blocking  introduces  a  substan¬ 
tial  amount  of  overhead  in  terms  of  buffer  ad¬ 
ministration,  index  computation  etc.  which 
reduces  the  gain  of  blocking.  The  overhead  is 
not  related  to  the  floating  point  operations  of 
the  solution  algorithm  in  a  simple  way,  and  a 
detailed  modelling  is  not  attempted. 

6  Processor  Allocation 

The  allocation  of  node  equations  (matrix 
rows)  to  processors  was  discussed  in  the  pre¬ 
vious  section.  The  basic  principle  is  the  row 
interleaved  scheme  where  runs  of  p  consecutive 
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node  equations  (rows)  are  allocated  to  p  dif¬ 
ferent  processors  and  p  is  the  number  of  avail¬ 
able  processors.  This  principle  is  modified  to 
the  block  row  interleaved  scheme  where  b  con¬ 
secutive  rows  are  lumped  together  and  treated 
as  one  in  the  allocation  scheme. 

The  purpose  of  row  interleaving  is  to  bal¬ 
ance  load  on  the  processors  during  the  LU- 
factorization  and  solution.  In  this  respect  it  is 
important  that  a  run  of  p  consecutive  rows  are 
allocated  to  p  different  processors.  However,  it 
is  not  important  where  the  rows  are  allocated. 
In  other  words,  two  rows  in  a  run  of  p  rows  are 
allocated  to  two  different  processors,  and  they 
can  be  interchanged  freely.  This  freedom  in 
allocating  node  equations  (rows)  within  a  run 
is  exploited  to  reduce  the  number  of  duplicate 
function  calculations  mentioned  in  Section  4 
and  to  improve  the  load  balance  during  com¬ 
putation  of  nonlinear  functions. 

Based  on  the  circuit  description  input  to 
ESACAP,  an  X  array  M  is  constructed 
where  M[i,j]  contains  the  number  of  nonlin¬ 
ear  two  terminal  elements  between  node  i  and 
node  j.  The  matrix  is  symmetric  and  the  di¬ 
agonal  M[i,  f]  contains  the  number  of  two  ter¬ 
minal  elements  from  node  i  to  ground.  If  the 
node  equations  t  and  j  both  are  on  processor  k, 
the  nonlinear  two  terminal  elements  between 
nodes  i  and  j  only  have  to  be  evaluated  once, 
i.  e.  only  processor  k  has  to  evaluate  these 
M[i,  j]  functions.  If  node  equations  i  and  j  are 
on  processors  k  and  /,  respectively,  both  these 
processors  must  compute  the  M\i,  j]  functions. 

Let  r  denote  the  number  of  rows  on  a  pro¬ 
cessor  (r  =  N/p),  and  let  the  array  fZ*  contain 
row  numbers  for  the  rows  on  processor  k.  The 
number  of  functions,  F*,  to  be  computed  by 
processor  k,  can  be  evaluated  by  the  following 
algorithm, 

for  i:=l  to  r  do 
begin 

{add  gross  number  of  functions} 
for  j:=l  to  H  do 

Fk:=Fk+(!CRkCi3,j]: 

{subtract  number  of  functions 
counted  twice} 
for  j:=i  to  i-1  do 
Fk:=Flc-MCRkCi]  ,Rk[j]] 

end 

Algorithm  4 


The  initial  processor  allocation  is  recorded 
and  the  following  Monte  Carlo  algorithm  is 
used  to  improve  the  allocation  by  interchang¬ 
ing  rows  within  a  run  of  p  consecutive  rows, 

for  run:=l  to  r  do 

for  it:=l  to  maxit  do 
begin 

improvement : =true; 
while  improvement  do 
begin 

chose  randomly  two  different 
rows,  s  rmd  t  from  run; 

{rows  s  and  t  are  on  proces¬ 
sors  ps  and  pt,  respectively} 
evaluate  the  effect  on  Fps  and 
Fpt  of  interchanging  them; 
{Algorithm  4} 
improvement : = 

{reduced  work  load} 

((Fps  is  not  increased)  and 
(Fpt  is  not  increased))  or 
{improved  load  balance} 

((Fps  decreases)  2md  Fpt<=Fps) 
or 

((Fpt  decreases)  and  Fps<=Fpt); 
if  improvement  then 

interchange  rows  s  and  t 

end 

end 

Algorithm  5 

Algorithm  5  is  repeated  until  no  more  im¬ 
provements  are  obtained.  When  several  con¬ 
secutive  rows  are  allocated  to  a  processor  ais  a 
block  they  are  treated  as  one  long  row  by  the 
algorithm.  The  first  version  of  the  processor 
allocation  algorithm  was  based  on  simulated 
annealing,  but  it  turned  out  to  be  much  more 
expensive  and  only  marginally  better  than  the 
simple  Monte  Carlo  algorithm. 

Tables  1-3  show  the  results  obtained  by  the 
Monte  Carlo  reallocation  algorithm  applied  to 


functions 

initial  reallocated 

max  F 

F 

<T 

94  66 

63.2  61.7 

14.4  3.0 

Table  1:  Nonlinear  functions  per  processor. 
Rows  per  block  6=1. 


591 


functions 

initial  reallocated 

max  F 

F 

<T 

98  71 

62.5  61.3 

17.1  6.7 

Table  2:  Nonlinear  functions  per  processor. 
Rows  per  block  6  =  2. 


functions 

initial  reallocated 

max  F 

F 

<T 

92  78 

61.4  61.0 

16.6  11.3 

Table  3:  Nonlinear  functions  per  processor. 
Rows  per  block  6  =  3. 

the  model  of  the  4  x  4  bit  multiplier.  The 
total  number  of  nonlinear  functions  of  the 
180  node  equations  modelling  the  multiplier  is 
1118.  When  the  node  equations  are  allocated 
to  32  processors,  approximately  2000  functions 
must  be  computed  because  of  the  need  for  du¬ 
plicate  computation. 

The  main  effect  of  the  Monte  Carlo  algo¬ 
rithm  is  to  reduce  the  maximum  number  of 
functions  allocated  to  one  processor  (max  F) 
and  thus  improve  load  balance.  This  is  also 
clearly  reflected  by  the  standard  deviation  cr 
of  the  number  of  functions  while  the  average 
number  of  functions  F  to  be  computed  by  a 
processor  is  only  reduced  little. 

With  increased  number  of  rows  in  a  block, 
the  freedom  to  reallocate  is  reduced,  and  thus 
the  efficiency  of  the  Monte  Carlo  algorithm. 
This  is  reflected  by  the  maximum  number  of 
functions  and  by  the  standard  deviation. 

7  Results 

The  performance  of  the  parallel  implementa¬ 
tion  of  ESACAP  is  evaluated  using  a  4  x  4  bit 
version  of  the  multiplier  described  in  Section 
2.  The  input  of  the  multiplier  is  a  sequence  of 
binary  numbers,  0000  x  0000,  0010  x  0010, 
1100  X  1100,  1011  X  1011,  0100  X  0100  and 
nil  X  nil.  The  duration  of  each  digit  is  20 
nsec  and  the  transition  from  one  level  to  an¬ 
other  takes  10  nsec.  The  total  simulated  time 
is  170  nsec. 


P 

6 

T 

F 

Fi/Fp 

Ti/Tp 

1 

- 

133,200 

1118 

- 

- 

8 

3 

29,870 

231 

4.84 

4.46 

16 

3 

17,892 

129 

8.67 

7.44 

32 

2 

12,124 

71 

15.75 

11.0 

Table  4:  Execution  time  and  speed-up  figures 
for  parallel  ESACAP 

The  main  performance  figures  are  given  in 
Table  4  which  lists  the  total  execution  time 
(T)  in  seconds  in  column  3  and  the  speed-up 
over  one  processor  (Ti/Tp)  in  column  6.  The 
Cube  Manager  and  the  sequential  ESACAP 
was  used  for  p  =  1  because  one  node  processor 
with  only  512KB  memory  is  too  small  to  hold 
the  problem.  At  least  8  processors  are  required 
to  simulate  the  multiplier  which  explains  why 
Table  4  does  not  have  entries  for  2  <  p  <  4. 

The  number  of  rows  in  a  block  (6)  is  chosen 
to  minimize  the  execution  time.  F  denotes 
the  maximum  number  of  nonlinear  functions 
to  be  computed  in  any  processor,  and  Fi/Fp  is 
the  ratio  of  nonlinear  functions  in  the  sequen¬ 
tial  version  over  maximum  number  of  nonlin¬ 
ear  functions  on  one  processor  in  the  parallel 
version.  This  ratio  is  seen  to  be  strongly  corre¬ 
lated  with  the  speed-up,  T\/Tp.  However,  the 
discrepancy  increeises  with  increasing  value  of 
p  since  the  solution  of  linear  equations,  which 
does  not  parallelize  well,  then  becomes  rela¬ 
tively  more  important  (see  Table  5). 

The  speed-up  obtained  for  less  than  32  pro¬ 
cessors  is  satisfactory.  For  32  processors  the 
problem  which  has  180  equations  is  too  small 
to  maintain  a  speed-up  value  of  approximately 

p/2. 

Table  5  shows  a  break  down  of  the  execu- 


P 

6 

Tf+j 

Tw 

Ts 

8 

1 

26,912 

1,594 

2,140 

8 

3 

26,854 

1,545 

1,210 

16 

1 

14,579 

1,451 

2,568 

16 

3 

14,752 

1,488 

1,350 

32 

1 

8,316 

1,480 

3,015 

32 

2 

8,432 

1,389 

1,841 

Table  5:  Break  down  of  execution  time  figures 
for  parallel  ESACAP 
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rr 

7 

Tb 

Tc 

Ti 

Tt 

Thtu 

T 

n 

0.19 

866 

644 

- 

- 

1,511 

1,511 

p 

- 

866 

322 

175 

45 

1,394 

P 

- 

866 

215 

264 

90 

1,435 

1,415 

P 

- 

866 

161 

351 

135 

1,513 

1,574 

B 

- 

866 

130 

440 

175 

1,745 

0.20 

492 

859 

- 

- 

1,351 

in 

- 

492 

mm 

195 

48 

1,276 

- 

492 

293 

96 

1,361 

0.23 

316 

1,073 

- 

- 

- 

316 

536 

257 

64 

uBI 

Table  6:  Execution  time  break  down  of  one  LU-factorization  given  in  msec.  The  times  Tb—Thu 
refer  to  formulas  (6)  -  (10).  T  is  measured  execution  time. 


tion  times  for  the  parallel  simulations  given 
in  Table  4.  Tf+j  denotes  the  time  spent  com¬ 
puting  the  nonlinear  functions  of  /„  and  FI,  as 
defined  in  connection  with  (3),  i.  e.  the  non¬ 
linear  functions  of  the  node  equations  and  the 
corresponding  derivatives.  Tm  and  Ts  denote 
the  time  spent  doing  LU-factorization  and  so¬ 
lution,  respectively.  The  execution  times  of 
Table  5  do  not  quite  add  up  to  the  execution 
times  of  Table  4  because  the  latter  includes 
some  additional  overhead. 

The  influence  of  blocking  rows  is  displayed 
in  Table  5  which  includes  execution  times  for 
6=1  and  for  block  sizes  giving  minimum 
execution  times.  The  blocking  reduces  the 
freedom  in  the  processor  allocation  algorithm, 
and  this  leads  to  a  slight  increase  in  Tp^j  for 
p  =  16  and  p=  32  (cf.  Tables  1-3). 

The  LU-factorization  only  benefits  from 
blocking  for  p  =  32,  and  this  phenomenon  is 
probably  due  to  better  processor  utilization. 
With  180  node  equations,  the  average  number 
of  rows  per  processor  is  less  than  6,  and  with 
an  average  density  of  the  Jacobi  matrix  of  0.1 
(different  from  -y  which  is  a  model  parameter), 
several  processors  will  be  idle  during  an  elim¬ 
ination  stage  if  pivot  rows  are  broadcast  one 
by  one. 

The  solution  algorithm  gains  most  from 
blocking  although  the  expected  reduction  of 
1/6  is  not  quite  obtained. 

Table  6  shows  execution  times  based  on  the 
model,  formulas  (6)  -  (10).  The  times  are  in 
milliseconds  and  refer  to  one  LU-factorization. 
The  parameter  j  which  is  an  average  row  den¬ 
sity  used  in  the  execution  time  model,  is  es¬ 


timated  for  6  =  1.  Therefore  TtLu  =  T  for 
6=1.  The  increase  in  7  with  increasing  p 
reflects  the  decrease  in  processor  utilization. 
The  remaining  parameters  of  the  model  are  as 
follows:  N  =  180,  Tp  =  50/isec,  To  =  1.2msec 
and  B  =  2b0words /msec.  Because  of  the  sub¬ 
stantial  overhead  involved  in  the  operations 
modelled  by  (7)  and  (8),  an  increased  value 
of  the  floating  point  execution  time  is  used, 
Tp  =  Ibfisec. 

The  execution  time  model  is  quite  accurate 
for  p  =  8  and  6  <  4.  For  larger  values  of 
6,  the  blocking  leads  to  poor  load  distribution 
which  is  not  modelled.  This  probably  also  ac¬ 
counts  for  the  less  accurate  values  for  p  =  16 
and  p  =  32.  However,  the  model  still  explains 
satisfactorily  why  the  execution  time  of  LU- 
factorization  is  not  reduced  by  the  full  amount 
of  saving  in  communication  (Tc)  when  block¬ 
ing  is  introduced. 

8  Conclusion 

The  DC  and  transient  analysis  part  of  the 
general  circuit  analysis  program  ESACAP  was 
parallelized  for  the  Intel  iPSC  with  a  modest 
effort  relative  to  the  size  and  complexity  of  the 
original  sequential  version. 

A  speed-up  of  the  parallel  version  over  a  se¬ 
quential  version  of  approximately  p/2  can  be 
expected  when  certain  conditions  are  fulfilled: 
the  problem  must  be  highly  nonlinear  (e.  g. 
a  digital  electrical  circuit)  and  the  amount  of 
work  per  processor  must  be  adequate. 

The  present  parallel  LU-factorization  is 
straightforward  and  not  very  efficient.  This 
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part  leaves  room  for  substantial  improvement, 
and  the  parallel  version  of  ESACAP  will  be 
used  in  the  future  as  a  test  vehicle  in  connec¬ 
tion  with  research  in  sparse  matrix  parallel  al¬ 
gorithms. 
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Abstract 

The  accurate,  high-speed  solution  of  systems  of  or¬ 
dinary  diflFerential-algebraic  equations  (DAE’s)  of  low 
index  is  of  great  importance  in  chemical,  electrical  and 
other  engineering  disciplines.  Petzold’s  Fortran-based 
DASSL  is  the  most  widely  used  sequential  code  for 
solving  DAE’s.  We  have  devised  and  implemented  a 
completely  new  C  code.  Concurrent  DASSL,  specifi¬ 
cally  for  multicomputers  and  patterned  on  DASSL.  In 
this  work,  we  address  the  issues  of  data  distribution 
and  the  performance  ci'  the  overall  algorithm,  rather 
than  just  that  of  individual  steps.  Concurrent  DASSL 
is  designed  as  an  open,  application-independent  envi¬ 
ronment  below  which  linear  algebra  edgorithms  may  be 
added  in  addition  to  standard  support  for  dense  and 
sparse  algorithms.  The  user  may  furthermore  attach 
explicit  data  interconversions  between  the  main  com¬ 
putational  steps,  or  choose  compromise  distributions. 
A  “problem  formulator”  (simulation  layer)  must  be 
constructed  above  Concurrent  DASSL,  for  any  specific 
problem  domain.  We  indicate  performance  for  a  pew- 
ticular  chemical  engineering  application,  a  sequence  of 
coupled  distillation  columns.  P\jture  efforts  are  cited 
in  conclusion. 

Introduction 

In  this  paper,  we  discuss  the  design  of  a  general- 
purpose  integration  system  for  ordinary  differential- 
algebraic  equations  of  low  index,  following  up  on 
our  more  preliminary  discussion  in  [16].  The  new 
solver,  Concurrent  DASSL,  is  a  parallel,  C-language 
implementation  of  the  algorithm  codified  in  Petzold’s 
DASSL,  a  widely  used  Fortran-based  solver  for  DAE’s 


[11,4],  and  based  on  a  loosely  synchronous  model  of 
communicating  sequential  processes  [9].  Concurrent 
DASSL  retains  the  same  numerical  properties  as  the 
sequential  algorithm,  but  introduces  important  new 
degrees  of  freedom  compared  to  it.  We  identify  the 
main  computational  steps  in  the  integration  process; 
for  each  of  these  steps,  we  specify  algorithms  that  have 
correctness  independent  of  data  distribution. 

We  cover  the  computational  aspects  ot  the  major 
computational  steps,  and  their  data  distribution  pref¬ 
erences  for  highest  performance.  We  indicate  the 
properties  of  the  concurrent  sparse  hnear  algebra  as 
it  relates  to  the  rest  of  the  calculation.  We  de¬ 
scribe  the  proto-Cdyn  simulation  layer,  a  distillation- 
simulation-oriented  Concurrent  DASSL  driver  which, 
despite  specificity,  exposes  important  requirements  for 
concurrent  solution  of  ordinary  DAE’s;  the  ideas  be¬ 
hind  a  template  formulation  for  simulation  are,  for  ex¬ 
ample,  expressed. 

We  indicate  formulation  issues  and  specific  features  of 
the  chemical  engineering  problem  -  dynamic  distilla¬ 
tion  simulation.  We  indicate  results  for  an  example 
in  this  area,  which  demonstrates  the  feasibility  of  this 
method,  but  the  need  for  additional  future  work,  both 
on  the  sparse  linear  algebra,  and  on  modifying  the 
DASSL  algorithm  to  reveal  more  concurrency,  thereby 
amortizing  the  cost  of  linear  algebra  over  more  time 
steps  in  the  algorithm. 

Mathematical  Formulation 

We  address  the  following  initial- value  problem  consist¬ 
ing  of  combinations  of  N  linear  and  nonlinear  coupled, 
ordinary  differential- algebraic  equations  over  the  inter¬ 
val!  6  [To,  Tj]: 
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IVP(F,u,Zo,[To,ri];/V,F): 

F(Z,Z,u;<)  =  0,  <e[To,Ti],  (1) 

Z(<  =  To)  =  Zo,  Z(t  =  To)  =  Zo, 

with  unknown  state  vector  Z(t)  6  3?^,  known  ex¬ 
ternal  inputs  u(<)  €  where  F(»;  t)  i— ►  and 
Zo,  Zo  €  are  the  given  initial- value,  derivative  vec¬ 
tors,  respectively.  We  will  refer  to  Equation  I’s  devi¬ 
ation  from  0  as  the  residuals  or  residual  vector.  Eval¬ 
uating  the  residuals  means  computing  F(Z,Z,u;<) 
(“model  evaluation”)  for  specified  arguments  Z,  Z,  u 
and  t. 

DASSL’s  integration  2dgorithm  can  be  used  to  solve 
systems  fully  implicit  in  Z  and  Z  and  of  index  zero  or 
one,  and  specially  structured  forms  of  index  two  (and 
higher)  [4,  Chapter  5],  where  the  index  is  the  minimum 
number  of  times  that  part  or  all  of  Equation  1  must 
be  differentiated  with  respect  to  t  in  order  to  express 
Z  as  a  continuous  function  of  Z  and  t  [4,  page  17]. 

By  substituting  a  finite-difference  approximation  TjZ 
for  Z,  we  obtain: 

Fi,(Zi;r.)  =  F(Zi,-D<Z<,u.;t  =  n)  =  0,  (2) 

a  set  of  (in  general)  nonlinear  staticized  equations.  A 
sequence  of  Equation  2’s  will  have  to  be  solved,  one 
at  each  discrete  time  t  =  r,,  i  =  1,2, ...  ,M^ ,  in  the 
numerical  approximation  scheme;  neither  M  nor  the 
r,’s  need  be  pre-determined.  In  DASSL,  the  variable 
step-size  integration  algorithm  picks  the  r,’s  as  the  in¬ 
tegration  progresses,  based  on  its  assessment  of  the  lo¬ 
cal  error.  The  discretization  operator  for  Z,  V,  varies 
during  the  numerical  integration  process  and  hence  is 
subscripted  as  Vi . 

The  usual  way  to  solve  an  instance  of  the  staticized 
equations.  Equation  2,  is  via  the  familiar  Newton- 
Raphson  iterative  method  (yielding  Z,  =  Zf°): 

Zf+‘  =  Zf-c{VzFp(Zr‘;r.)}"‘Fp(Z?;n). 

*  =  0,1,...  (3) 

given  an  initial,  sufficiently  good  approximation  Z°. 
The  classical  method  is  recovered  for  mj  =  k  and 
c  =  1,  wherejts  a  modified  (damped)  Newton- Raphson 
method  results  for  mj  <  *  (respectively,  c  <  1). 
In  the  original  DASSL  algorithm  and  in  Concurrent 
DASSL,  the  Jacobian  V2Fd(Z)  is  computed  by  fi¬ 
nite  differences  rather  than  analytically;  this  departure 
leads  in  another  sense  to  a  modified  Newton- Raphson 
method  even  though  m*  =  *  and  c  =  1  might  al¬ 
ways  be  satisfied.  For  termination,  a  limit  k  <  k* 

’and  more  at  trial  timepoints  which  are  discarded  by  the 
integration  algorithm. 


is  imposed;  a  further  stopping  criterion  of  the  form 
||Z‘+‘  -  Zf  II  <  f  is  also  incorporated  (see  Brenan  et 
at.  [4,  pages  121-124]). 

Following  Brenan  et  ai,  the  approximation  7>,Z  is 
replaced  by  a  BDF-generated  linear  approximation, 
orZ  -t-  0,  and  the  Jacobian 

5F  dT 

VzF(Z,aZ-b/?,u;0=^  +  a^.  (4) 

From  this  approximation,  we  define  Ta,0(2t\Ti)  in  the 
intuitive  way.  We  then  consider  Taylor’s  Theorem  with 
remainder,  from  which  we  can  easily  express  a  forward 
finite-difference  approximation  for  each  Jacobian  col¬ 
umn  (assuming  sufficient  smoothness  of  F,,^^)  with  a 
scaled  difference  of  two  residual  vectors; 

Fa,/3(Z  4- ;  Tj)  —  Fa  ^j(Z;  Tj)  = 

{VzF„^(Z;rO}^;  -h  o(||^,||^)  (5) 

By  picking  6j  proportional  to  ,  the  jth  unit  vector  in 
the  natural  basis  for  3?^ ,  namely  8j  —  djBj,  Equation  5 
yields  a  first-order-accurate  approximation  <n  dj  of  the 
jth  colunrui  of  the  Jacobian  matrix: 

*’cr,g(Z  +  Sj]  r^)  —  Fa,ff(Z;  r^)  _ 

dj 

{^zFa,^(Z;n)}e;  -h  0(dj), 

j  =  h-..,N  (6) 

Each  of  these  N  Jacobian-colunrui  computations  is  in¬ 
dependent  and  trivially  parallelizable.  It’s  well  known, 
however,  that  for  special  structures  such  as  banded  and 
block  n-diagonal  matrices,  and  even  for  generad  sparse 
matrices,  a  single  residual  can  be  used  to  generate  mul¬ 
tiple  Jacobian  columns  [4,8].  We  discuss  these  issues 
as  part  of  the  concurrent  formulation  section  below. 

The  solution  of  the  Jacobian  linear  system  of  equa¬ 
tions  is  required  for  each  ^-iteration,  either  through 
a  direct  {e.g.,  LU-factorization)  or  iterative  {e.g., 
preconditioned-conjugate-gradient)  method.  The 
most  advantageous  solution  approach  depends  on  as 
well  as  special  mathemat  :al  properties  and/or  struc¬ 
ture  of  the  Jacobian  matrix  Together,  the 

inner  (linear  equation  solution)  and  outer  (Newton- 
Raphson  iteration)  loops  solve  a  single  time  point; 
the  overall  algorithm  generates  a  sequence  of  solution 
points  Z,,  f  =  0, 1, . . . ,  M . 

In  the  present  work,  we  restrict  our  attention  to  di¬ 
rect,  spzirse  linear  algebra  as  described  in  [13],  al¬ 
though  future  versions  of  Concurrent  DASSL  will  sup¬ 
port  the  iterative  linear  algebra  approaches  by  Ashby, 
Lee,  Brown,  Hindmarsh  et  at.  [3,5].  For  the  sparse 
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LU  factorization,  the  factors  are  stored  and  reused  in 
the  modified  Newton  scenairio.  Then,  repeated  use  of 
the  old  Jacobian  implies  just  a  forward  and  back-solve 
step  using  the  triangular  factors  L  and  U .  Practically, 
we  can  use  the  Jacobian  for  up  to  about  five  steps  [4]. 
The  useful  lifetime  of  a  single  Jacobian  evidently  de¬ 
pends  somewhat  strongly  on  details  of  the  integration 
procedure  [4]. 

proto-Cdyn  -  Simulation  Layer 

To  use  the  Concurrent  DASSL  system  on  other  than 
toy  problems,  a  simulation  layer  must  be  constructed 
above  it.  The  purpose  of  this  layer  is  to  accept  a 
problem  specification  from  within  a  specific  problem 
domain,  and  formulate  that  specification  for  concur¬ 
rent  solution  as  a  set  of  differential-algebraic  equa¬ 
tions,  including  any  needed  data.  On  one  hand,  such 
a  layer  could  explicitly  construct  the  subset  of  equa¬ 
tions  needed  for  each  processor,  generate  the  appro¬ 
priate  code  representing  the  residual  functions,  and 
create  a  set  of  node  programs  for  effecting  the  sim¬ 
ulation.  This  is  the  most  flexible  approach,  allowing 
the  user  to  specify  arbitrary  nonlinear  DAE’s.  It  has 
the  disadvantage  of  requiring  a  lot  of  compiling  and 
linking  for  each  run  in  which  the  problem  is  changed 
in  any  significant  respect  (including  but  not  limited 
to  data  distribution),  although  with  sophisticated  tac¬ 
tics,  parametric  variations  within  equations  could  be 
permitted  without  re-compiling  from  scratch,  and  in¬ 
cremental  linking  could  be  supported. 

We  utilize  a  template-based  approach  here,  as  we  do 
in  the  Waveform-Relaxation  paradigm  for  concurrent 
dynamic  simulation  [15].  This  is  akin  to  the  ASCEND 
II  methodology  utilized  by  Kuru  and  many  others  [lOj. 
It  is  a  compromise  approach  from  the  perspective  of 
flexibility;  interesting  physical  prototype  subsystems 
are  encapsulated  into  compiled  code  as  templates.  A 
template  is  a  conceptual  building  block  with  states, 
non-states,  parameters,  inputs  and  outputs  (see  be¬ 
low).  A  general  network  made  from  instantiations 
of  templates  can  be  constructed  at  runtime  without 
changing  any  executable  code.  User  input  specifies  the 
number  and  type  of  each  template  their  interconnec¬ 
tion  pattern,  and  the  initial  value  of  systemic  states 
and  extraneous  (non-state)  variables,  plus  the  value  of 
adjustable  parameters  and  more  elaborate  data,  such 
as  physical  properties.  The  addition  of  templates  re¬ 
quires  new  subroutines  for  the  evaluation  of  the  resid¬ 
uals  of  their  associated  DAE’s,  and  also  for  interfac¬ 
ing  to  the  remainder  of  the  system  (e.g.,  parsing  of 
user  input,  interconnectivity  issues).  With  suitable 
automated  tools,  this  addition  process  can  be  made 


straightforward  to  the  user. 

Importantly,  the  use  of  a  template- based  methodology 
does  not  imply  a  degradation  in  the  numerical  qual¬ 
ity  of  the  model  equations  or  solution  method  used. 
We  are  not  obliged  to  tear  equations  based  on  tem¬ 
plates  or  groups  of  templates  as  is  done  in  sequential- 
modular  simulators  [19,6],  where  “sequential”  refers 
in  this  sense  to  the  stepwise  updating  of  equation  sub¬ 
sets,  without  connection  to  the  number  of  computers 
assigned  to  the  problem  solution. 

Ideally,  the  simulation  layer  could  be  made  universal. 
That  is,  a  generic  layer  of  high  flexibility  and  structural 
elegance  would  be  created  once  and  for  all  (and  with¬ 
out  predilection  for  a  specific  computational  engine). 
Thereafter,  appropriate  templates  would  be  added  to 
articulate  the  simulator  for  a  given  problem  domain. 
This  is  certainly  possible  with  high-quality  simulators 
such  as  ASCEND  II  and  Chemsim  (a  recent  Fortran- 
based  simulator  driving  DASSL  and  MA28  [2,11,7]). 
Even  so,  we  have  chosen  to  restrict  our  efforts  to 
a  more  modest  simulation  layer,  called  proto-Cdyn, 
which  can  create  arbitrary  networks  of  coupled  distil¬ 
lation  columns.  This  restricted  effort  has  required  sig¬ 
nificant  effort,  and  already  allows  us  to  explore  many 
of  the  important  issues  of  concurrent  dynamic  simu¬ 
lation.  General-purpose  simulators  are  for  future  con¬ 
sideration.  They  must  address  significant  questions  of 
user-interface  in  addition  to  concurrency-formulation 
issues. 

In  the  next  paragraphs,  we  describe  the  important  fea¬ 
tures  of  proto-Cdyn.  In  doing  so,  we  indicate  impor¬ 
tant  issues  for  any  Concurrent  DASSL  driver. 

Template  Structure 

A  template  is  a  prototype  for  a  sequence  of  DAE’s 
which  can  be  used  repeatedly  in  different  instantia¬ 
tions.  Normally,  but  not  always,  the  template  cor¬ 
responds  to  some  subsystem  of  a  physical-model  de¬ 
scription  of  a  system,  like  a  tank  or  distillation  tray. 
The  key  characteristics  of  a  template  are:  the  number 
of  integration  states  it  incorporates  (typically  fixed), 
the  number  of  non-state  variables  it  incorporates  (typ*- 
ically  fixed),  its  input  and  output  connections  to  other 
templates,  and  external  sources  (forcing  functions)  and 
sinks.  State  variables  participate  in  the  overall  DASSL 
integration  process.  Non-states  are  defined  as  vari¬ 
ables  which,  given  the  states  of  a  template  alone,  may 
be  computed  uniquely.  They  are  essentially  local  tear 
variables.  It  is  up  to  the  template  designer  whether  or 
not  to  use  such  local  tear  variables:  They  impact  the 
numerical  quality  of  the  solution,  in  principle.  Alter¬ 
native  formulations,  where  all  variables  of  a  template 
are  treated  as  states,  can  be  posed,  and  comparisons 
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made.  Because  of  the  superlinear  growth  of  linear  al¬ 
gebra  complexity,  the  introduction  of  extra  integra¬ 
tion  states  must  be  justified  on  the  basis  of  numerical 
accuracy.  Otherwise,  they  artificially  slow  down  the 
problem  solution,  perhaps  significantly.  Non-states 
are  extremely  convenient,  and  pr^tically  useful;  they 
appear  in  all  the  dynamic  simulators  we  have  come 
across. 

The  template  state  and  non-state  structure  implies  a 
two-phase  residual  computation.  First,  given  a  state 
Z,  the  non-states  of  each  template  are  updated  on 
a  template-by-template  basis.  Then,  given  its  states 
and  non-states,  inputs  from  other  templates  and  ex¬ 
ternal  inputs,  each  template’s  residuals  may  be  com¬ 
puted.  In  the  sequential  implementation,  this  poses  no 
particular  nuisances,  other  than  two  evaluation  loops 
over  all  templates.  However,  in  concurrent  evaluation, 
a  communication  phase  intervenes  between  non-state 
updates  and  residual  updates.  This  communication 
phase  transmits  all  states  and  non-  states  appearing  as 
outputs  of  templates  to  their  corresponding  inputs  at 
other  templates.  This  transmission  mechanism  is  con¬ 
sidered  further  below  under  concurrent  formulation. 

Problem  Preformulation 

In  general,  the  “optimal”  ordering  for  the  equations  of 
a  dynamic  simulation  will  in  general  be  too  difficult  to 
establish^,  because  of  the  NP-hard  issues  involved  in 
structure  selection.  However,  many  important  heuris¬ 
tics  can  be  applied,  such  as  those  that  precedence  or¬ 
der  the  nonlinear  equations,  and  those  that  permute 
the  Jacobian  structure  to  a  more  nearly  triangular  or 
banded  form  [8].  For  the  proto-Cdyn  simulator,  we 
skirt  these  issues  entirely,  because  it  proves  easy  to  ar¬ 
range  a  network  of  columns  to  produce  a  “good  struc¬ 
ture”  -  a  main  block  tri-diagonal  Jacobian  structure 
with  off-block-diagonal  structure  for  the  intercolumn 
connections,  simply  by  taking  the  distillation  columns 
with  their  states  in  tray-by-tray,  top-down  (or  bottom- 
up)  order. 

Given  a  set  of  DAE’s,  and  an  ordering  for  the  equa¬ 
tions  and  states  (i.e.,  rows  and  columns  of  the  Jaco¬ 
bian,  respectively),  we  need  to  partition  these  equar 
tions  between  the  multicomputer  nodes,  according  to 
a  two-dimensional  process  grid  of  shape  PxQ  —  R. 
The  partitioning  of  the  equations  forms,  in  main  part, 
the  so-called  “concurrent  database.”  This  grid  struc¬ 
ture  is  illustrated  in  (13,  Figure  2.].  In  proto-Cdyn,  we 

^OptimAlily  per  $e  hinges  on  what  our  objective  is.  If,  for 
instance,  we  want  minimum  time  for  LU  factorization,  still  the 
objective  of  minimum  fill-in  does  not  guarantee  minimum  time 
in  a  concurrent  setting. 


utilize  a  single  process  grid  for  the  entire  Concurrent 
DASSL  calculation.  That  is,  we  don’t  currently  ex¬ 
ploit  the  Concurrent  DASSL  feature  which  allows  ex¬ 
plicit  transformations  between  the  main  calculational 
phases  (see  below).  In  each  process  column,  the  en¬ 
tire  set  of  equations  is  to  be  reproduced,  so  that  any 
process  column  can  compute  not  only  the  entire  resid¬ 
ual  vector  for  a  prediction  calculation,  but  also,  any 
column  of  the  Jacobian  matrix. 

A  mapping  between  the  global  equations  and  local 
equations  must  be  created.  In  the  general  case,  it  will 
be  difficult  to  generate  a  closed-form  expression  for  ei¬ 
ther  the  global-to-local  mapping  or  its  inverse  (that 
also  require  <  0(N)  storage).  At  most,  we  will  have 
on  a  hand  a  partial  (or  weak)  inverse  in  each  process,  so 
that  the  corresponding  global  index  of  each  local  index 
will  be  available.  Furthermore,  in  each  node,  a  partial 
global-to-local  list  of  indices  associated  with  the  given 
node  will  be  stored  in  global  sort  order.  Then,  by  bi¬ 
nary  search,  a  weak  global-to-local  mapping  will  be 
possible  in  each  process.  That  is,  each  process  will 
be  able  to  identify  if  a  global  index  resides  within  it, 
and  the  corresponding  local  index.  A  strong  mapping 
for  row  (column)  indices  will  require  communication 
between  all  the  processes  in  a  process  row  (respec¬ 
tively,  column).  In  the  foregoing,  we  make  the  tacit 
assumption  that  is  is  an  unreasonable  practice  to  use 
storage  proportional  to  the  entire  problem  size  N  in 
each  node,  except  if  this  unscalability  can  be  removed 
cheaply  when  necessary  for  large  problems. 

The  proto-Cdyn  simulator  works  with  templates  of 
specific  structure  -  each  template  is  a  form  of  a  dis¬ 
tillation  tray  and  generates  the  same  number  of  inte¬ 
gration  states.  It  therefore  skirts  the  need  for  weak 
distributions.  Consequently,  the  entire  row  mapping 
procedure  can  be  accomplished  using  the  closed-form 
general  two-parameter  distribution  function  family  ^ 
described  in  [13],  where  the  block  size  B  is  chosen  as 
the  number  of  integration  states  per  template.  The 
column  mapping  procedure  is  accomplished  with  the 
one-parameter  distribution  function  family  C  also  de¬ 
scribed  in  [13].  The  effects  of  row  and  column  degree- 
of-scattering  are  described  in  [13]  with  attention  to 
linear  algebra  performance. 

Concurrent  Formulation 

Overview 

Next,  we  turn  to  Equation  I’s  (that  is,  FVP’s)  concur¬ 
rent  numerical  solution  via  the  DASSL  algorithm.  We 
cover  the  major  computational  steps  in  abstract,  and 
we  also  describe  the  generic  aspects  of  proto-Cdyn  in 
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this  connection.  In  the  subsequent  section,  we  discuss 
issues  peculiar  to  the  distillation  simulation. 

Broadly,  the  concurrent  solution  of  IVP  consists  of 
three  block  operations:  startup,  dynamic  simulation, 
and  a  cleanup  phase.  Significant  concurrency  is  appar¬ 
ent  only  in  the  dynamic  simulation  phase.  We  will  as¬ 
sume  that  the  simulation  interval  requested  generates 
enough  work  so  that  the  startup  and  cleanup  phases 
prove  insignificant  by  comparison  and  consequently 
pose  no  serious  Amdahl’s-law  bottleneck.  Given  this 
assumption,  we  can  restrict  our  attention  to  a  single 
step  of  rVP  as  illustrated  schematically  in  Figure  0. 

In  the  startup  phase,  a  sequential  host  program  inter¬ 
prets  the  user  specification  for  the  simulation.  From 
this  it  generates  the  concurrent  database:  the  tem¬ 
plates  and  their  mutual  interconnections,  data  needed 
by  particular  templates,  and  a  distribution  of  this  in¬ 
formation  among  the  processes  that  are  to  participate. 
The  processes  are  themselves  spawned  and  fed  their  re¬ 
spective  databases.  Once  they  receive  their  input  in¬ 
formation,  the  processes  re-build  the  data  structures 
for  interfacing  with  Concurrent  DASSL,  and  for  gener¬ 
ating  the  residuals.  Tolerances,  and  initial  derivatives 
must  be  computed  and/or  estimated.  Furthermore,  in 
each  process  column,  the  processes  must  rendezvous  to 
finalize  their  communication  labeling  for  the  transmis¬ 
sion  of  states  and  non-states  to  be  performed  during 
the  residual  calculation.  This  provides  the  basis  for 
a  reactive,  deadlock-free  update  procedure  described 
below. 

The  cleanup  phase  basically  retrieves  appropriate  state 
values  and  returns  them  to  the  host  for  propagation 
to  the  user.  Cleanup  may  actually  be  interspersed  in¬ 
termittently  with  the  actual  dynamic  simulation.  It 
provides  simple  bookkeeping  of  the  results  of  simular 
tion  and  terminates  the  concurrent  processes  at  the 
simulation’s  conclusion. 

The  dynamic  simulation  phase  consists  of  repetitive 
prediction  2md  correction  steps,  and  marches  in  time. 
Each  successful  time  step  requires  the  solution  of  one 
or  more  instances  of  Equation  2  -  additional  timesteps 
that  converge  but  fail  to  satisfy  error  tolerances,  or  fail 
to  converge  quickly  enough,  are  necessarily  discarded. 
In  the  next  section,  we  cover  the  aspects  of  these  op¬ 
erations  in  more  detail,  for  a  single  step. 

Single  Integration  Step 

The  Integration  Computations  of  DASSL  are  a 
fixed  leading-coefficient,  variable-stepsize  and  order, 
backward-differentiation-formula  (BDF)  implicit  inte¬ 
gration  scheme,  described  clearly  in  [4,  Chapter  5]  and 


outlined  in  [11].  Concurrent  DASSL  faithfully  imple¬ 
ments  this  numerical  method,  with  no  significant  dif¬ 
ferences.  Test  problems  run  with  the  DASSL  Fortran 
code  and  the  new  C  code  (on  one  and  multiple  com¬ 
puters)  certify  this  degree  of  compatibility. 

The  sequential  time  complexity  of  the  integration  com¬ 
putations  is  0(N),  if  considered  separately  from  the 
residual  calculation  called  in  turn,  which  is  also  nor¬ 
mally  0(N)  (see  below).  We  pose  these  operations 
on  a  PxQ  =  R  grid,  where  we  assume  that  each  pro¬ 
cess  column  can  compute  complete  residual  vectors. 
Each  process  column  repeats  the  entire  prediction  op¬ 
erations:  there  is  no  speedup  associated  with  Q  >  1, 
and  we  replicate  all  DASSL  BDF  and  predictor  vec¬ 
tors  in  each  process  column.  Taller,  narrower  grids  are 
likely  to  provide  the  overall  greatest  speedup,  though 
the  residual  calculation  may  saturate  (and  slow  down 
again)  because  cJ  excessive  vertical  communication  re¬ 
quirements  —  It’s  definitely  not  true  that  the  Rxl 
shape  is  optimal  in  all  cases. 

The  distribution  of  coefficients  in  the  rows  has  no  im¬ 
pact  on  the  integration  operations,  and  is  dictated 
largely  by  the  requirements  of  the  residual  calculation 
itself.  In  practical  problems,  the  concurrent  database 
cannot  be  reproduced  in  each  process  (c/.,  [18]),  so  a 
given  process  will  only  be  able  to  compute  some  of  the 
residuals.  Furthermore,  we  may  not  have  complete 
freedom  in  scattering  these  equations,  because  there 
will  often  be  a  tradeoff  between  the  degree  of  scatter¬ 
ing  and  the  amount  of  communication  needed  to  form 
the  entire  residual  vector. 

The  amount  of  0{N)  integration-computation  work  is 
not  terribly  large  —  there  is  consequently  a  non-trivial 
but  not  tremendous  effort  involved  in  the  integration 
computations.  (Residual  computations  dominate  in 
many  if  not  most  circumstances.)  Integration  oper¬ 
ations  consist  mainly  of  vector-vector  operations  not 
requiring  any  interprocess  communication  and,  in  ad¬ 
dition,  fixed  startup  costs.  Operations  include  predic¬ 
tion  of  the  solution  at  the  time  point,  initiation  and 
control  of  the  Newton  iteration  that  “corrects”  the  so¬ 
lution,  convergence  and  error- tolerance  checking,  and 
so  forth.  For  example,  the  approximation  Dj  is  cho¬ 
sen  within  this  block  using  the  BDF  formulas.  For 
these  operations,  each  process  colunrui  currently  oper¬ 
ates  independently,  and  repetitively  forms  the  results. 
Alternatively,  each  process  column  could  stride  with 
step  Q,  and  raw-combines  could  be  used  to  propagate 
information  across  the  columns  [14].  This  alternative 
would  increase  speed  for  sufficiently  large  problems, 
and  can  easily  be  implemented.  However,  because  of 
load-imbalance  in  other  stages  of  the  calculation,  we 
are  convinced  that  including  this  type  of  synchronizar 
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tion  could  be  an  overall  negative  rather  than  positive 
to  performance.  This  alternative  will  nevertheless  be 
a  future  user-selectable  option. 

Included  in  these  operations  cire  a  handful  of  norm 
operations,  which  constitute  the  main  interprocess 
conrununication  required  by  the  integration  computar 
tions  step;  norms  are  implemented  concurrently  via 
recursive  doubling  (combine)  [17,14].  Actually,  the 
weighted  norm  used  by  DASSL  requires  two  recur¬ 
sive  doubling  operations,  each  combines  a  scalar;  first 
to  obtain  the  vector  coefficient  of  maximum  absolute 
value,  then  to  sum  the  weighted  norm  itself.  Each  can 
be  implemented  as  Q  independent  column  combines, 
each  producing  the  same  repetitive  result,  or  a  single 
Q-striding  norm,  that  takes  advantage  of  the  repeti¬ 
tion  of  information,  but  utilizes  two  combines  over  the 
entire  process  grid.  Both  are  supported  in  Concurrent 
DASSL,  although  the  former  is  the  default  norm.  As 
with  the  original  DASSL,  the  norm  function  can  be 
replaced,  if  desired. 

Single  Residuals  are  computed  in  prediction,  and 
as  needed  during  correction.  Multiple  residuals  are 
computed  when  forming  the  finite- difference  Jacobian. 
Single  residuals  are  computed  repetitively  in  each  pro¬ 
cess  column,  whereas  the  multiple  residuals  of  a  Jaco¬ 
bian  computation  are  computed  uniquely  in  the  pro¬ 
cess  columns. 

Here,  we  consider  the  single  residual  computation  re¬ 
quired  by  the  integration  computations  just  described. 
Given  a  state  vector  Z,  and  approximation  for  Z,  we 
need  to  evaluate  F(Z,Z,ri)  =  FD(Z,rj).  The  ex¬ 
ploitable  concurrency  available  in  this  step  is  strictly 
a  function  of  the  model  equations.  As  defined,  there 
are  N  equations  in  this  system,  so  we  expect  to  use 
at  best  N  computers  for  this  step.  Practiccilly,  there 
will  be  interprocess  communication  between  the  pro¬ 
cess  rows,  corresponding  to  the  connectivity  among  the 
equations.  This  will  place  an  upper  limit  on  P  <  K 
(the  number  of  row  processes)  that  can  be  used  before 
the  speed  will  again  decrease:  we  can  expect  efficient 
speedup  for  this  step  provided  that  the  cost  of  the 
interprocess  communication  is  insignificant  compared 
to  the  single-equation  grain  size.  As  estimated  in  [14], 
the  granularity  Tcomm/Teaie  for  the  Symult  s2010  mul¬ 
ticomputer  is  about  fifty,  so  this  implies  about  four 
hundred  and  fifty  floating  point  operations  per  commu¬ 
nication  in  order  to  achieve  90%  concurrent  efficiency 
in  this  phase. 

Jacobian  Computation  There  is  evidently  much 
more  available  concurrency  in  this  computational  step 


than  for  the  single  residual  and  integration  operations, 
since,  for  finite  differencing,  N  independent  residual 
computations  are  apparently  required,  each  of  which 
is  a  single-state  perturbation  of  Z.  Based  on  our 
overview  of  the  residual  computation,  we  might  naively 
expect  to  use  K  x  N  processes  effectively;  however, 
the  simple  perturbations  can  actually  require  much 
less  model  evaluation  effort  because  of  latency  [8,10], 
which  is  directly  a  function  of  the  sparsity  structure  of 
the  model  equations.  Equation  1.  In  short,  we  can  at¬ 
tain  the  same  performance  with  much  less  than  K  x  N 
processors. 

In  general,  we’d  like  to  consider  the  Jacobian  compu¬ 
tation  on  a  rectangular  grid.  For  this,  we  can  con¬ 
sider  using  P  X  Q  =  R  to  aiccomplish  the  calculation. 
With  a  general  grid  shape,  we  exploit  some  concur¬ 
rency  in  both  the  column  evaluations  and  in  the  resid¬ 
ual  computations,  with  Tjac,PzQ=R  the  time  for  this 
step,  Sjac,PzQ=R  the  corresponding  speedup,  Tre,,p 
the  residual  evaluation  time  with  P  row  processes,  and 
Sres.p  the  apparent  speedup  compared  to  one  row  pro¬ 
cess; 

Tjac,PxQ=R  *  l^/QI  ^  Tres^P 

N 

Sjac,PxQ=R  ^  f N /Q]  ^  ^re»,P 

assuming  no  shortcuts  are  available  as  a  result  of  la¬ 
tency.  This  timing  is  exemplified  in  the  example  below, 
which  does  not  take  advantage  of  latency. 

There  is  additional  work  whenever  the  Jacobian 
structure  is  rebuilt  for  better  numerical  stability  in 
the  subsequent  LU  factorization  (A-mode).  Then, 
0(N^/PQ)  work  is  involved  in  each  process  in  the  fill¬ 
ing  of  the  initial  Jacobian.  In  the  normal  case,  work 
proportional  to  the  number  of  local  non-zeroes  plus 
fill  elements  is  incurred  in  each  process  for  re-filling 
the  sparse  Jacobian  structure. 

Exploitation  of  Latency  has  been  considered  in 
the  Concurrent  DASSL  framework.  We  currently 
have  experimental  versions  of  two  mechanisms,  both 
of  which  are  designed  to  work  with  the  sparse-matrix 
structures  associated  with  direct,  sparse  LU  factoriza¬ 
tion  (see  [13]).  The  first  is  called  “bandlike”  Jacobian 
evaluation.  For  a  banded  Jacobian  matrix  of  band¬ 
width  b,  only  b  residuals  are  needed  to  evaluate  the 
Jacobian.  This  feature  is  incorporated  into  the  origi- 
n^d  DASSL,  along  with  a  LINPACK  banded  solver.  In 
Concurrent  DASSL,  collections  of  Jacobian  columns 
are  placed  in  each  process  column,  according  to  the  col¬ 
umn  data  distribution,  which  thus  far  is  picked  solely 
to  balance  LU  factorization  and  triangular-solve  per¬ 
formance  [13].  In  each  process  column,  there  will  be 
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“compatible”  columns  that  can  be  evaluated  using  a 
single,  composite  perturbation.  Identification  of  these 
compatible  columns  is  accomplished  by  checks  on  the 
bandwidth  overlap  condition.  Columns  that  possess 
off-band  structure  are  stricken  from  the  list  and  eval¬ 
uated  separately.  Presumably,  a  heuristic  algorithm 
could  be  employed  further  to  increase  the  size  of  the 
compatible  sets,  but  this  is  yet  to  be  implemented. 
The  same  algorithm  “greedy”  algorithm  of  Curtis  ct 
a/,  used  for  the  sequential  reduction  of  Jacobian  com¬ 
putation  effort  would  be  applied  independently  to  each 
process  column  (see  comments  by  [8,  Section  12.3]). 
Then,  clearly,  the  column  distribution  effects  the  per¬ 
formance  of  the  Jeicobian  computation,  and  the  linear- 
algebra  performance  can  no  longer  be  viewed  so  readily 
in  isolation. 

We  have  also  devised  a  “blocklike”  format,  which  will 
be  applied  to  block  n-diagonal  matrices  that  include 
some  off-block  entries  as  well.  Optimally,  fewer  resid¬ 
ual  computations  will  be  needed  than  for  the  banded 
case.  The  same  column-by-column  compatible  sets  will 
be  created,  and  the  Curtis  algorithm  can  also  be  ap¬ 
plied.  Hopefully,  because  of  the  less  restrictive  com¬ 
patibility  requirement,  the  “blocklike”  case  will  pro¬ 
duce  higher  concurrent  speedups  thcin  that  attained 
using  the  conservative  bandlike  assumption  for  Jaco- 
bians  possessing  blocklike  structure.  Comparative  re¬ 
sults  will  be  presented  in  a  future  paper. 

The  LU  Factorization  Following  the  philosophy 
of  Harwell’s  MA28,  we  have  interfaced  a  new  con¬ 
current  sparse  solver  to  Concurrent  DASSL,  the  de¬ 
tails  of  which  are  quoted  elsewhere  in  this  proceedings 
[13].  In  short,  there  is  a  two-step  factorization  proce¬ 
dure:  A-mode,  which  chooses  stable  pivots  according 
to  a  user-specified  function,  and  builds  the  sparse  data 
structures  dynamically;  and  B-mode,  which  re-uses  the 
data  structures  and  pivot  sequence  on  a  similar  ma¬ 
trix,  but  monitors  stability  with  a  growth-factor  test. 
A-mode  is  repeated  whenever  necessary  to  avoid  in¬ 
stability.  We  expect  sub-cubic  time  complexity  and 
sub-quadratic  space  complexity  in  JV  for  the  sparse 
solver.  We  attain  acceptable  factorization  speedups 
for  systems  that  are  not  narrow  banded,  and  of  suf¬ 
ficient  size.  We  intend  to  incorporate  multiple  pivot¬ 
ing  heuristic  stategies,  following  [1],  further  to  improve 
perform2mce  of  future  versions  of  the  solver.  This  may 
also  contribute  to  better  performance  of  the  triangular 
solves. 

Forward-  and  Back-solving  Steps  take  the  fac¬ 
tored  form 

PftAPj  =  LCr, 


with  L  unit  lower-triangular,  U  upper-triangular,  and 
permutation  matrices  Ph,  Pc,  and  solve  Az  =  6,  us¬ 
ing  the  implicit  pivoting  approach  described  in  [13]. 
Sequentially,  the  triangular  solves  each  require  work 
proportional  to  the  number  of  entries  in  the  respec¬ 
tive  triangular  factor,  including  fill-in.  We  have  yet  to 
find  an  example  of  sufficient  size  for  which  we  actually 
attain  speedup  for  these  operations,  at  least  for  the 
sparse  case.  At  most,  we  try  to  prevent  these  opera¬ 
tions  from  becoming  competitive  in  cost  to  the  B-mode 
factorization;  we  detail  these  efforts  in  [13].  In  brief, 
the  optimum  grid  shape  for  the  triangular  solves  has 
Q  =  1,  and  P  somewhat  reduced  than  what  we  can 
use  in  all  the  other  steps.  As  stated,  P  small  seems 
better  thus  far,  though  for  many  examples,  the  in¬ 
creasing  overhead  as  a  function  of  increasing  P  is  not 
un2icceptable  (see  [13]  and  the  example  below). 

Residual  Communication  is  an  important  aspect 
of  the  proto-Cdyn  layer.  As  indicated  in  the  startup- 
phase  discussion,  the  members  of  a  process  column 
initially  share  information  about  the  groups  of  states 
and  non-states  they  will  exchange  during  a  residual 
computation.  For  residual  communication,  a  reactive 
transmission  mechanism  is  employed,  to  avoid  dead¬ 
locks.  Each  process  transmits  its  next  group  of  states 
to  the  appropriate  process  and  then  looks  for  any  re¬ 
ceipt  of  state  information.  Along  with  the  state  val¬ 
ues  are  indices  that  directly  drive  the  destinations  for 
these  values.  This  index  information  is  shared  during 
the  startup  phase  and  allows  the  messages  to  drive  the 
operation.  Through  non-blocking  receives,  this  proce¬ 
dure  avoids  problems  of  transmission  ordering.  Re¬ 
gardless  of  the  template  structure,  at  most  one  send 
and  receive  is  needed  between  any  pair  of  column  pro¬ 
cesses. 

Chemical  Engineering  Example 

The  algorithms  and  formalism  needed  to  run  this  ex¬ 
ample  amount  to  about  70,000  lines  of  C  code  includ¬ 
ing  the  simulation  layer.  Concurrent  DASSL,  the  linear 
algebra  packages,  and  support  functions  [14,13,12]. 

In  this  simulation,  we  consider  seven  distillation 
columns  arranged  in  a  tree-sequence  [12],  work¬ 
ing  on  the  distillation  of  eight  alcohols:  methanol, 
ethanol,  propan- l-ol,  propan- 2-ol,  butan-l-ol,  2- 
methyl  propan- l-ol,  butan-2-ol,  and  2- methyl  propan- 
2-ol.  Each  column  has  143  trays.  Each  tray  is  ini¬ 
tialized  to  a  non-steady  condition,  and  the  system  is 
relaxed  to  the  steady  state  governed  by  a  single  feed 
stream  to  the  first  column  in  the  sequence.  This  setup 
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generates  suitable  dynamic  activity  for  illustrating  the 
cost  of  a  single  “transient”  integration  step. 

We  note  the  performance  in  Table  0.  Because  we 
have  not  exploited  latency  in  the  Jacobian  computa¬ 
tion,  this  calculation  is  quite  expensive,  as  seen  for 
the  sequential  times  on  a  Sun  3/260  depicted  there. 
(The  timing  for  the  Sun  3/260  is  quite  comparable 
to  a  single  Symult  s2010  node  and  was  lightly  loaded 
during  this  test  run.)  As  expected,  Jacobian  calcula¬ 
tions  speedup  efficiently,  and  we  are  able  to  get  ap¬ 
proximately  a  speedup  of  100  for  this  step  using  128 
nodes.  The  A-mode  linear  algebra  also  speeds  up  sig¬ 
nificantly.  The  B-mode  factorization  speeds  up  negli¬ 
gibly  and  quickly  slows  down  again  for  more  than  16 
nodes.  Likewise,  the  triangular  solves  are  significantly 
slower  than  the  sequential  time.  It  should  be  noted 
that  B-mode  reflects  two  orders  of  magnitude  speed 
improvement  over  A-mode.  This  reflects  the  fact  that 
we  are  seeing  almost  linear  time  complexity  in  B-mode, 
since  this  example  has  a  narrow  block  tri-diagonal  Ja¬ 
cobian  with  too  little  off-diagonal  coupling  to  gener¬ 
ate  much  fill-in.  It  seems  hard  to  imagine  speeding 
up  B-mode  for  such  an  example,  unless  we  can  exploit 
multiple  pivots.  We  expect  multiple-pivot  heuristics 
to  do  reasonably  well  for  this  case,  because  of  its  nar¬ 
row  structure,  and  nearly  block  tri-diagonal  structure. 
We  have  used  Wilson  Equation  Vapor-Liquid  Equilib¬ 
rium  with  the  Antoine  Vapor  equation.  We  have  found 
that  the  thermodynamic  calculations  were  much  less 
demanding  than  we  expected,  with  bubble-point  com¬ 
putations  requiring  iterations  to  converge.  Con¬ 

sequently,  there  was  not  the  greater  weight  of  Jacobian 
calculations  we  expected  beforehand.  Our  model  as¬ 
sumes  constant  pressure,  and  no  enthalpy  balances. 
We  include  no  flow  dynamics  and  include  liquid  and 
vapor  flows  as  states,  because  of  the  possibility  of  feed¬ 
backs. 

Were  we  to  utilize  latency  in  the  Jacobian  calcula¬ 
tion,  we  could  reduce  the  sequential  time  by  a  fac¬ 
tor  of  about  100.  This  improvement  would  also  carry 
through  to  the  concurrent  times  for  Jacobian  solution. 
At  that  ratio,  Jacobian  computation  to  B-mode  fac¬ 
torization  has  a  sequential  ratio  of  about  10:1.  As  is, 
we  achieve  legitimate  speedups  of  about  five.  We  ex¬ 
pect  to  improve  these  results  using  the  ideas  quoted 
elsewhere  here  and  in  [13]. 

From  a  modeling  point-of-view,  two  things  are  im¬ 
portant  to  note.  First,  the  introduction  of  more 
non-ideal  thermodynamics  would  improve  speedup, 
because  these  calculations  fall  within  the  Jacobian 
computation  phase  and  Single- Residual  Computation. 
Furthermore,  the  introduction  of  a  more  realistic 
model  will  likewise  bear  on  concurrency,  and  likely  im¬ 


prove  it.  For  example,  introducing  flow  dynamics,  en¬ 
thalpy  balances  and  vapor  holdups  makes  the  model 
more  difficult  to  solve  numerically  (higher  index).  It 
also  increases  the  chance  for  a  wide  range  of  step-sizes, 
and  the  possible  need  for  additional  A-mode  factoriza¬ 
tions  to  maintain  stability  in  the  integration  process. 
Such  operations  are  more  costly,  but  also  have  a  higher 
speedup.  Furthermore,  the  more  complex  models  will 
be  less  likely  to  have  near  diagonal  dominance;  con¬ 
sequently  more  pivoting  is  to  be  expected,  again  in¬ 
creasing  the  chance  for  overall  speedup  compared  to 
the  sequential  case.  Mainly,  we  plan  to  consider  the 
Waveform- Relaxation  approach  more  heavily,  and  also 
to  consider  new  classes  of  dynamic  distillation  simula¬ 
tions  with  Concurrent  DASSL  [12]. 

Conclusions 

We  have  developed  a  high-quality  concurrent  code, 
Concurrent  DASSL,  for  the  solution  of  ordinary 
o  Terential-algebraic  equations  of  low  index.  This 
code,  together  with  appropriate  linear  algebra  and 
simulation  layers,  allows  us  to  explore  the  achievable 
concurrent  performance  of  non-trivial  problems.  In 
chemical  engineering,  we  have  applied  it  thus  far  to 
a  reasonably  large,  simple  model  of  coupled  distilla¬ 
tion  columns.  We  are  able  to  solve  this  large  problem, 
which  is  quite  demanding  on  even  a  large  mainframe 
because  of  huge  memory  requirements  and  non-trivial 
computational  requirements;  the  speedups  achieved 
thus  far  are  legitimately  at  least  five,  when  compared 
to  an  efficient  sequential  implementation.  This  illus¬ 
trates  the  need  for  improvements  to  the  linear  algebra 
code,  which  are  feasible  because  sparse  matrices  will 
admit  multiple  pivots  heuristically.  It  also  illustrates 
the  need  to  consider  hidden  sources  of  additional  time- 
like  concurrency  in  Concurrent  DASSL,  perhaps  allow¬ 
ing  multiple  right-hand  sides  to  be  attacked  simultane¬ 
ously  by  the  linear  algebra  codes,  and  amortizing  their 
cost  more  efficiently.  Furthermore,  the  performance 
points  up  the  need  for  detailed  research  into  the  novel 
numerical  techniques,  such  as  Waveform  Relaxation, 
which  we  have  begun  to  do  as  well  [15]. 
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Table  0.  Order  9009  Dynamic  Simulation  Data 

(time  in  seconds) 

Grid  Shape 

Jacobian 

A- mode 

B-mode  Bzu^k-Solve 

Solve 

1x1 

64672.2 

5089.96 

61.82 

2.5 

4.7 

8x1 

6870.82 

1024.41 

47.827 

15.619 

30.825 

16x1 

3505.13 

547.625 

52.402 

19.937 

39.491 

32x1 

1829  93 

316.544 

56.713 

24.383 

47.692 

64x1 

1060.40 

219.148 

77.302 

39.942 

59.553 

32x4 

491.526 

181.082 

71.482 

57.049 

101.994 

64x2 

520.029 

161.052 

82.696 

46.013 

86.935 

128x1 

608.946 

170.022 

90.905 

37.498 

67.982 

Key  single-step  calculation  times  with  the  1x1  case  run  an  unloaded  Sun  3/260  (amilai  performance-wise  to  a  single 
Symult  s2010  node)  for  comparison.  The  Jacobian  rows  were  distributed  in  block-linear  form,  with  B  =  9,  reflecting  the 
distillation-tray  structure.  The  Jacobian  columns  were  scattered.  This  is  an  seven  cdumn  simulation  of  eight  alcohols, 
with  a  total  of  1,001  trays.  See  [13]  for  more  on  data  distributions. 
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Figure  0.  Major  computational  blocks  of  a 
Single  Integration  Step. 


A  single  step  in  the  integration  begins  with  a  number  of 
BDF-related  computations,  including  the  solution  “predic¬ 
tion”  step.  Then,  “correction”  is  achieved  through  New¬ 
ton  iteration  steps,  each  involving  a  Jacobian  computation, 
and  linear-system  solution  (LU  factorization  plus  forward- 
/  back-solves).  The  computation  of  the  Jacolnan  in  turn 
relies  upon  multiple  independent  residual  calculations,  as 
shown.  The  three  items  enclosed  in  the  dashed  oval  (Ja¬ 
cobian  computation  (through  at-most  N  Residual  compu¬ 
tations),  and  LU  factorization)  are,  in  practice,  computed 
less  often  than  the  others  —  the  old  Jacobian  matrix  is  used 
in  the  iteration  loop  until  convergence  slows  intolerably. 


604 


CONVERGENCE  AND  CIRCUIT  PARTmONING  ASPECTS  FOR  WAVEFORM  RELAXATION 


UUa  Miekkala  Olavi  Nevanlinna 

Helsinki  University  of  Technology 
Institute  of  Mathematics 
021S0  Espoo,  Finland 


Albert  Ruehli 

IBM  T.  J.  Watson  Research  Center 
Yorktown  Heights 
NY  10598,  U.S.A 


ABSTRACT 

This  paper  gives  a  mathematical  investigation  of  the  con¬ 
vergence  properties  of  a  model  problem  which,  at  first 
sight,  seems  to  be  unsuitable  for  waveform  relaxation. 
The  model  circuit  represents  a  limiting  case  for  capaci¬ 
tive  coupling  where  the  o^acitances  to  ground  are  zero. 
We  show  that  the  WR  reproach  converges.  Since  the 
convergence  is  generally  slow  we  discuss  an>ro{Kiate 
techniques  for  accelerating  convergence. 


Figure  1. 


1.  INTRODUCTION 


A  substantial  speed-up  can  be  achieved  in  the  analysis 
of  large  circuits  by  using  the  waveform  relaxation  0^) 
method  [4]  rather  than  the  conventional  direct,  incremen¬ 
tal  time  (IT)  methods.  Another  speed  improvement  can 
be  obtained  by  applying  the  WR  method  to  parallel  pro¬ 
cessors  since  the  rq)proach  is  based  on  partitioning  the 
circuit  into  small  subcircuits  which  are  assigned  to  the 
different  processors.  Hence,  a  central  problem  in  the  ap¬ 
proach  is  the  partitioning  into  the  "best"  possible  subcir¬ 
cuits.  Our  experience  has  been  that  the  gain  of  the  WR 
over  the  IT  method  is  a  function  of  the  partitioning. 

In  this  short  ptqwr,  we  consider  convergence  for  linear 
model  circuits.  For  any  RC-ciicuit,  indq)endentiy 
whetho’  the  index  is  one  a*  two,  geometric  convergence 
can  be  proved  if  the  partitioning  can  be  performed  across 
the  resistors  only  [3].  However,  this  restriction  will  in 
many  cases  lead  to  large  subcircuits.  This  is  especially 
the  case  for  MOSFET  transistor  circuits  where  the  gate 
to  drain  capacitance  plays  a  key  role  in  the  partitioning. 
Partitioning  must  be  performed  across  capacitances  in 
this  case  since  partitioning  across  resistors  only  would 
lead  to  unduly  large  subcircuits.  It  is  essential  in  some 


cases  that  the  partitioning  is  performed  even  for  situa¬ 
tions  where  the  convergence  cannot  be  achieved  in  only 
a  few  iterations.  The  connections  among  the  subcircuits 
may  simply  lead  to  very  large  subciicuits.  It  is  in  most 
cases  desirable  to  partition  into  subcircuits  of  a  similar 
size.  Non-uniform  scheduling  schemes,  like  the  c-sched- 
uling,  can  take  advantage  of  partitioning  schemes  with 
wittely  varying  convergence  rates  at  the  subcircuit  level 
[1]. 


Here  we  are  investigating  a  verj'  interesting  model  circuit 
shown  in  Figure  1.  The  iteration  becomes 


(1.1) 


It  is  a  limiting  case,  where  a  coupling  capacitor  is  present, 
while  the  capacitors  to  ground  are  missing.  This  example 
represents  a  worst  case  situation  which  is  somewhat  non 
physical  since  each  node  has  a  capacitance  to  ground  in 
a  VLSI  environment  Earlier  convergence  proofs  show 
that  the  WR  approach  converges,  if  we  partition  this  cir¬ 
cuit  into  two  subcircuits  across  the  capacitor,  provided 
that  a  crqracitor  to  ground  is  present  [4],  [10]. 
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Intuitively,  one  assumes  that  the  absence  of  grounded 
capacitors  will  prevent  convergence  since  the  nodes  are 
coupled  together  when  (  — »  oo.  Here,  we  show  that  the 
WR  solution  for  the  circuit  in  Fig.  1  will  indeed  converge 
but  that  the  speed  of  convergence  is  quite  slow.  This  is 
included  in  Theorem  1  in  Section  2.  The  result  shows 
that  the  speed  of  convergence  depends  on  the  number  of 
correct  (terivatives  at  t  =  0.  In  Section  3  we  discuss 
the  discretized  iteration.  One  can  show  that  for  a  fixed 
time  step  h  the  convergence  is  geometric,  ie  of  the  form 
p"  with  p  of  the  form  l/(  1  +  ch) .  For  the  second  order 
BDF-formula  this  is  obtained  if  the  coupling  derivatives 
are  read  through  a  "filter”.  In  Section  4  we  shortly  men¬ 
tion  some  possibilities  for  accelerating  the  convergence. 

2.  CONVERGENCE 


does  not  converge  geometrically,  say,  in  Lz .  However,  as 
p(if(»0)  <  1  forall|{|  <  oo  there  is  still  convergence, 
slower  and  such  that  it  depends  on  the  smoothness  of  the 
initial  error.  As  large  values  of  |{|  correspond  to  fast  time 
scales  we  may  expect  the  conv^-gence  speed  to  depend 
on  the  smoothness  of  the  initial  error  the  smoother  the 
initial  error  the  faster  the  convogence.  We  shall  make 
this  precise. 

Consider  the  following  iteration 

Rit*  +  My'  =  (2 .5) 

with  given  and  i/“(  0)  =  0  for  all  n.  Here  R,  M,  S 
and  N  are  square  matrices  such  that 

zR+  M  is  nonsingular  for  Rez  >  0 .  (2 .6) 


The  model  problem  (1.1)-  without  iteration  index  n  -  is 
an  index  one  differential-algebraic  equation  (DAE).  This 
can  be  seen  by  adding  the  two  equations;  we  get  an  alge¬ 
braic  equation  for  the  voltages.  This  means  in  particular 
that  only  one  of  the  initial  values  can  be  freely  chosen. 


The  iteration  in  ( 1 .1)  is  of  the  form 

i*  +  kji"  =  ij"*  +  f\ 

±2  +  XziS  =  ir*  +  /2- 


(2.1) 


This  is  now  an  ODE  system  for  (if,  ij) .  bul  unless  the 
initial  values  satisfy  the  extra  restriction  we  cannot  ex¬ 
pect  convergence. 


If  we  denote  by  y"  the  iteration  error  y"  =  i  -  i»,  then 
we  may  assume  y*(0)  =  0  for  all  n.  Thus 


»" + >1!/" = j/r* 

yJ  +  Aay"  =  yr‘- 


(2.2) 


Taking  the  Laplace  transform  of  (2.2)  yields  (y(z)  = 
/o“  e-**y(t)dt) 

y'(z)  =  K{z)\rHz),  (2.3) 


where 

According  to  the  basic  :q)proach  [S],  [6]  one  then  looks 
at  the  spectral  radius  of  K(z)  along  z  =  i(: 


f  c2  c2  -]  >/“ 

In  particular  p(A'(t())  <  1  for  |^|  <  cm  but  near  in¬ 
finity  lim  p(K(iO)  =  1.  This  means  that  the  process 

Kl-wo 


Then  the  symbol  of  the  iteration  (2 .5) 

if(z)  =  {zR+  M)-\zS+N) 

is  an  analytic  matrix- valued  function  in  Rez  >  0.  We 
make  the  following  model  assumption. 

3  nonnegative  constants  C,  a,  b  with  o  <  1  such  that 
for  all  n 

(2.7) 

where  ||  ||  denotes  the  matrix  norm  induced  by  the  usual 
Euclid^  length.  In  the  example  we  have  N  =  0  and 

(2 .7)  holds  with  o  =  0 . 

Next  we  need  the  Sobolev  norms: 

u  G  iff 

INI.  :=  1^1(1 +  {")*iiu(, oil' <oo. 

(2.8) 

Observe  that  =  Lz. 

From  (2 .5)  and  the  definition  of  the  symbol  K{z)  we 
seeth^ 

y"(z)  =  if(z)«y"(z) 
and  by  Parseval’s  identity 

iiy"iio  =  ^  (2 .9) 

Now  we  make  the  smoothness  assumption. 
y®  G  H*,  for  some  a  >  0. 


(2.10) 


Using  ( 2 .6)  and  ( 2 .7)  in  ( 2 .8)  we  have 

I|y“ll5 

<  ^/  +  er%^+ er\\y\iofdi 

where 


On  the  other  hand,  think  y**  to  be  given  on  the  whole  R 
and  vanishing  identically  for  t  <  0.  Then  it  is  clear  that 
the  continuity  at  origin  shows  up  in  the  Sobolev  exponent 
provided  that  the  initial  error  is  otherwise  smooth.  Recall 
that  if  y*^  e  H*,  then  by  the  Sobolev  embedding  lemma 
has  continuous  derivatives  up  to  [a  -  .  Therefore, 

if(2.12)  holds  but  Z7'y®(0)  y  0  then  [a-  <  1-  1, 

and  necessarily  a  <  1  y. 


:=  supC^ 
( 


A  simple  calculation  yields 


'P(n)  =  0((-)'/2). 
n 


Theorem  1  Under  the  model  assumptions  ( 2 .6) ,  (2 .7) 
and  under  the  smoothness  assumption  (2.10)  we  have 

lly*||o  =  0((i)*/2)||y°|U.  (2.11) 

n 

■ 

The  actual  exponent  a  in  ( 2 . 10)  depends  strongly  on  the 
preparation  of  initial  guess  at  t  =  0 .  If  only  y°(0)  »  0 
is  assumed,  then  by  partial  integration  we  liave 

y°(a)  =  ^  j  e“*‘y®(t)(tt. 

If  we  set  z  s  a  +  %(,  then  this  implies  as  a  >  0 


Although  Theorem  1  is  quite  sharp  as  such,  the  smooth¬ 
ness  assumption  in  form  of  Sobolev  nom  ||y*‘||«  does 
not  contain  information  on  where  y°  is  spectrally  large. 
As  typically  we  would  expect  (2 . 12)  only  to  hold  with 
1=1,  Theorem  1  only  says  that  eventually  the  conver¬ 
gence  is  likely  to  be  of  the  form  ^  with  r  ~  3/4.  Tocap- 
ture  the  decay  in  the  early  sweeps,  we  can  look  first  again 
the  example  (2.1).  For  simplicity,  let  =  Xa  =  1, 
X2=f2  =  0  and  x\  =  f\.  Then,  for  the  iteration  error, 
we  have 

y?(2)  =  -Vliz)  =  j7i(z)- 

If  now,  e.g.  /i(t)  =  C(  1  -  e"'’*),  then 


y?(z)  =  -C 


{2z+  Ufz  +  Tf) 


Notice  that  |yi  (z)|  ~  C  for  small  |z|  and  for  large  |z| 


)i/?(^))  ~ 

We  model  the  general  case  in  the  same  way 


ly»(iOI<inin{l,|^}.  (2.14) 


1-0/  M  ^  Const 

^  w 

If  y°(2)  is  analytic  at  infinity  then 

|y®(z)|  =  0(|^),  |z|-oo. 

If  we,  however,  prepare  the  initial  guess  so  that 

zyy®(0)  =  0  for  ;  =  0 . 1-l,  (2.12) 

then,  performing  I  partial  integrations  and  assuming  as 
before  that  y‘’(z)  is  analytic  at  infinity,  we  have 

li>“(^)l  =  0(i^).  (2.13) 

Thus,  integrating  this  along  z  =  (or  along  z  =  a  +  t’C 
with  a  moderate  a  >  0  if  needed)  we  obtain 


and  again  consider  the  iteration  ( 2 .5) .  Suppose  that  we 
would  like  to  st(^  the  iteration  when  Hv'llo  <  c-  Of 
course,  we  expect  to  see  that  n  depends  on  1  /c  and  that 
for  small  7  we  have  rapid  convergence.  We  present  a 
simple-minded  approach  which  shows  this  qualitatively, 
while  for  each  special  case  the  computation  should  be 
carried  out  in  more  detail.  In  estimating  ||y"||o  we  break 
the  integral  into  two  parts: 


For  Ii  we  require  I\  <  ^  f  ~  which  is  the 

case  if  r  :=  f  ^  j  .Now  consider  all  other  constants 
to  be  fixed  (i.e.  C,  a,  b)  and  think  c  and  7  as  variables. 
We  want  h  hence  we  approximate 


y^  €  H*  for  all  a  <  1+  j. 


c 


2 
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Here  we  have  a  geometric  rate  and  accuracy  level  e  is 
reached  with  n  ~  log(  1  /e)  sweeps  (assuming  the  ini¬ 
tial  accuracy  level  1)  provided  stays  bounded,  i.e. 
-y/E  <consL  We  summarize 


After  discretization  and  ^-transformation  we  namely  get 
(a(0  +  hK<))i>’  =  a(C)v"-‘.  (3.4) 

whose  transfer  function 


Theorem  2  Assume  the  model  assumptions  (2 .6) .  (2 .7) 
and  initial  ennar  in  the  form  (2 .14) .  There  exist  cm- 
stantscx  andcx  only  depending  on  C,  a,  hin{l.T)  such 
that  for  all  small  t  and')  with')/t  <  ct  the  accuracy  re¬ 
quirement  ||j/"l|o  <  e  is  reached  with  n  ~  C2  log(  1  /e) 
sweeps. 

■ 


3.  DISCRETIZATION 


We  consider  the  time  discretization  of  (2 .1)  with  con¬ 
stant  time  step  h  and  multistep  (Jb-step)  method  defined 
through  its  generating  polynomials  a(0  =  Ylo 
6(C)  =  52o  -  There  are  several  possibilities  to  dis¬ 

cretize  the  right  hand  side  of  the  equations.  Since  it  con¬ 
tains  derivative  terms  one  could  use  simply  the  method 
defined  by  a(C)  •  On  the  oth»  hand,  it  can  be  treated  as  a 
source  term  since  it  is  known  from  previous  iteration.  As 
we  will  see,  discretization  using  a  multiple  of  time  step  h 
and  thus  having  a  ’’filtering"  effect  is  particularly  interest¬ 
ing  in  our  model  problem.  We  denote  the  discretization 
of  the  RHS  here  defined  by  o(C)  =  j,  a/C^  where  lo 
may  be  positive  so  that  a  uses  more  steps  than  k.  Equa¬ 
tions  for  the  iteration  error  become 


k  k  k 

52  + /» 52  =  52  “lyTjL 

0  0  -i, 

*  k  k  (3  1) 

52  “/vz  52  =  52 

0  0  -Jo 

1/  =  0 , 1 , . . .  y" 0  “  1/2,0  “  0  • 

In  k  -space  (3.1)  can  be  analyzed  by  using  the  C-trans- 
formation  (discrete  Lt^lace  transformation)  because  of 
Parseval’s  identity.  With  similar  derivation  as  in  [7]  we 
obtain 


(o(C)  +  h6(C)Xi)yr(C)  =  o(C)yr‘(0 

(o(C)  +  hbioxtmio  =  &(c)yr*(o 


where  y(C)  =  for  the  sequence  {y/}S“.  As 

*  a  +  h6^i  a  +  hh\2  *  ' 

it  is  clearly  sufficient  to  study  the  simpler  (although  non¬ 
physical)  model  iteration 

y"  +  y»  =  y*-’.  (3.3) 


Kk(.0  * 


0(0 


0(0  +  hbiO 


(3.5) 


gives  the  essential  information  about  the  rate  of  conver¬ 
gence  for  (3.1),  too. 

As  shown  in  [7]  the  convergence  rate  is  given  by 

suplA:fc(0|.  (3.6) 

KI^J 

As  weknowthatp(7f(iC))  — »  1  for  the  time  continuous 
iteration  when  |C|  — »  oo,  we  here  want  to  study  ( 3 .6)  as 
a  function  of  time  step  h. 

Example  3.1  Let  us  use  backward  Euler  in  (3.1)  and 
choose  a  =  a.  Then  the  discrete  equation,  where  ^ 
y(>h) ,  becomes 


y?  = 


1 


1  +  ) .  yo  =  0  Vn.  (3 .7) 

For  the  first  step  we  get  iteration 

1 


-1 


It  converges  to  zero  with  the  rate  itj,  which  apinoaches 
1  as  h  — >  1 .  Thus  mesh  refinement  would  slow  down 
convergence.  The  maximum  in  (3 .6)  for  backward  Eu¬ 
ler  can  be  easily  computed  as  max#{o(e’*)/  (a(c'®)  + 
h6(e’*))  }  Rs  1  -  h/2 ,  and  is  reached  at  0  =  w. 


Let  us  now  study  /Ch(C)  given  in  ( 3 .5) .  If  6  =  a  we  can 
write  (assuming  a(0  jiO) 


KkiO  = 


1 


1  +  hb/aiO 


One  sees  immediately  that  for  instance  with  trapezoidal 
rule  |iCji(  01  reaches  the  value  1  since  6(e‘*)  =  0  for  this 
method.  As  the  connection  between  the  Laplace  variable 
i^andCisC  =  we  may  consider  the  term  h6/Q(e*^'*) 

for  small  values  of  ((h)  using  the  usual  order  conditions 
for  multistep  methods,  see  e.g.  [2] 


=  i((l  +  cp(ieh)'>  +  cv,(t(h)'’*'  +  ...) 

(3.8) 

where  p  is  the  order  of  the  method  ( a,  6)  and  Cp  the  error 
constant  Inverting  ( 3 .8)  we  get 

+  ...).  (3.9) 

o  i( 
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If  a  ^  a  is  used  then  it  may  be  possible  to  improve  the 
rate  of  convergence  since  K^iO  "ow  becomes 


Kh(.0  = 


q/o(C) 

1  +  hb/a(0  ■ 


(3.10) 


— 7  and  5i  >  0  since  the  method  is  A-stable.  We  get  gain 
in  speed  from  1  -  h/2  to  1  -  h  but  we  lose  in  the  error 
constant  respectively. 

Second  order  methods  p  =  2 


If  such  a  can  be  chosen  that  ||(e^*)|  is  smaller  than 
1  then  the  absolute  value  of  Kh  can  be  diminished.  In 
addition,  a  must  be  such  that  the  order  of  the  multistep 
method  is  preserved,  i.e.  for  small  values  of  (^h) 

-(e‘<'')  =  l  +  'T,(t{h)'’+T2(tWP*'  +  ....  (3.11) 
a 

Let  us  now  approximate  (3 .10)  for  first  order  methods. 
First  order  methods  p  =  1 

The  real  and  imaginary  parts  of  (3.9)  and  (3.11)  be¬ 
come 

£(e«*)  =  ( 1  -  T2(fh)2  +  . . .) 

a 

=  (6,/»  +...)  +  i(-i  +  62  fh"  +  •••). 

Notice  that  for  A-stable  methods  6i  must  be  nonnegative. 
So  letting  c  denote  a  positive  constant  we  can  approxi¬ 
mate 

'  "  l  +  26iA+ ^  +  0(A2) 

1 

~  1  +  cA  ’ 


Here (3. 11)  and (3.9)  become 

“(c*^*)  =  1  -  izUbf  +  ■  ■■+  f(-73(^A)^  +  •  ■  •) 
a 

+  . . .  +  ,(_i  +  +  ••■). 

where  -63  is  nonnegative  for  A-stable  methods.  So  we 
have 


|A'h(e‘^'‘)|" 


1  -2>T2({A)^  +  0((^A)^) 
1  -  263f2A3  +  ^  +  0(A2) 


and  \Kj^  grows  fix)m  0  to  l/(  1  +  cA)  as  f  grows  from  0 
to  1  /y/h  provided  72  >  0 .  Again  what  happens  for  large 
values  of  (  depends  on  stability  region  and  a.  We  give 
some  examples  which  show  that  it  is  possible  to  choose 
o  in  such  a  way  that  [Ki,\  <  1/(1  +  ch). 

Example  3.2  Let  us  use  BDF2  method  and  4-step  BDF2 
as  the  filter;  a(C)  =  5•a(<^)/C^,  where  q(0  =  - 

2C  +  j  and  6(C)  =  C^-  The  discrete  iteration  is  now 

+  3yr'/4-y;":i+y;t:i/4). 

It  is  straightforward  to  compute  that 


where  the  last  inequality  holds  if  61  >  0, 7?  -  272  < 
0  and  Id  <  What  happens  for  large  values  of  C 
depends  then  on  stability  region  and  a.  For  example  fw 
backward  Euler  (with  a  =  a)  we  have 

for  all  d 

Ifonecombines  the  filter  a  =  y(C— C”’)  with  backward 
Euler  then  the  constant  multiplying  A  can  be  improved. 
The  discrete  iteration  ( 3 .7)  is  modified  to 

y?  =  yT^(v?-.  +  (yr'  -  V;":l)/2).  yj  =  o  Vn. 


|£(e‘«)|2  =  1  _ 

0''  1  +  4  sin^  0 


which  implies  that  ||(e^^)|  <  1  always  and  especially 
for  small  ({A) 


|-(e'')lR.l-(CA)^ 

o 

We  may  also  compute  that 

^  1  +  A6/Q(e‘®)  ^  ~  1  +  cA 
so  that  we  may  conclude  that 


Maximization  in  ( 3 .6)  gives  now 


(e»»-e-»*)/2  _ 

I  -  1  +  he'*  '  "  1  +  A  ’ 


|ifh(e")|< 


1 

1  +  cA 


holds  for  all  6.  As  without  the  filter  we  have  at  ^  =  A^/^ 


and  the  maximum  is  obtained  at  a  small  value  of  0  with 
cosO=  j^.  The  constants  in  (3. 11)  satisfy  7,  -2yi  = 


*1  + A6/Q(e'<'')*~  l  +  cA^/z’ 
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the  acceleration  is  now  of  qualitative  nature. 

Example  3.3  A  natural  choice  for  a  filter  to  be  used  with 
the  trapezoidal  rule  would  be  derivative  approximation 
over  3  time  steps  symmetrically  with  respect  to  the  point 
hU  -  1/2): 

a«)  = 

where  a(Q  =  <  -  1  and  b(0  =  y(C  +  1)  for  the  trjq)e- 
zoidal  rule.  Here  the  discrete  equation  would  be 

Vi  =  1  -  V2)y^,  +  iyj:l  -  yjzi  )/3) . 

One  can  compute  the  approximation 


-(e")  =  1  -6d^  +  0(e^) 
a 

for  small  9  so  that  this  filter  would  seem  to  improve  con- 
ergence  for  small  ((h) .  Also 


+  1|<  1. 


However,  since  Kh(0  is  here 


Kh(0  = 


C-l  +  /i«+l)/2’ 


we  notice  that  Kh(0  not  bounded  as  C  — »  oo.  The 
reason  is  that  the  filter  a  is  here  such  that  when  computing 
the  value  p"  ph  y^ih])  we  use  the  value  from  the 
previous  iteration.  In  practice  this  leads  to  difficulties  in 
determining  the  initial  values  for  the  sweeps. 

One  may,  of  course,  consider  other  methods  and  con- 
suxict  filters  to  them,  e.g.  Example  3.2.  suggests  that  the 
filter  a{C)  =  5-a(C^)/C*  would  wOTk  for  BDF  methods, 
but  we  have  here  considered  only  the  A-stable  methods 
which  have  order  p  <2. 


4.  ACCELERATION 


As  the  convergence  is  quite  slow,  acceleration  is  impor¬ 
tant  The  following  possibilities,  at  least  in  in  principle, 
can  be  used. 

(i)  preprocess  the  inital  guess 

(ii)  regularize  the  problem 

(iii)  play  a  game  with  gradual  mesh  refinement 


(i)  It  was  demonsuated  at  the  end  of  Section  2  that  if  one 
can  prepare  the  initial  guess  in  such  a  way  that  its  deriva¬ 
tives  are  correct  up  to  order  I  —  I,  then  we  can  expect 
the  convergence  of  the  form  0(  with  r  112+  XjA. 
So  all  one  has  to  do  is  to  "solve”  the  problem  over  a  tiny 
interval  and  then  extend  this  solution  smoothly  for  larger 
time  values. 

(ii)  There  are  several  ways  to  regularize  a  "nearly  sin¬ 
gular”  problem.  For  the  present  example  a  natural  trick 
would  be  to  include  an  extra  capacitor  C\ ,  see  Figure  2, 
in  the  early  iterations  and  let  gradually  C\  —*0. 

(iii)  A  "tolerance  game”  has  been  analyzed  in  [9]  for 
"short  window  -  supterlinear  convergence"  and  in  [8]  for 
"long  window  -  geometric  convergence”.  The  basic  idea 
is  to  balance  the  discretization  error  and  the  iteration  er¬ 
ror  while  computing.  Here  this  is  rather  easy  as  the  con¬ 
vergence  is  slow  (i.e.  p  ~  1  -  fi)  and  as  we  proceed,  the 
step  size  reductions  occur  less  and  less  frequently.  When 
p  const  <  1 ,  the  step  size  reductions  take  place  re¬ 
peatedly  and  the  problem  arises  whether  one  can  reliably 
make  decisions  on  step  size  reductions  without  effecting 
the  actual  rate  of  convergence.  This  has  been  analyzed 
in  [8]  in  detail. 

The  actual  problem  shows  up  in  the  interptolation  of 
coarse  mesh  couplings.  It  has  been  shown  that  there  are 
stable  and  reliable  ways  to  interpolate  at  any  order.  Here 
we  can  omit  this  problem. 

Let  us  first  estimate  the  amount  of  work  needed  to  solve 
a  model  problem  with  fixed  time  step  h,  when  the  order 
of  the  method  is  p  and  we  assume  the  model  p  ~  1  -  h. 
Let  t  denote  the  tolerance  we  are  interested  in.  Then  one 
sweep  takes  ~  (7)'^'’  time  points.  Now,  as  p  ~ 
1  —  we  have  with  sweeps  the  error  down  in  7. 
Thus 

R.iogi(i)>/p 
h  €6 

sweeps  will  lake  the  initial  error  of  order  1  down  to  level 

e. 

Total  amountof  work  Wo  is  proportional  to  y//i  and  thus: 

P^o  ~(-)^/'’log-. 

€  e 

Let  us  compute  a  similar  estimate  with  the  mesh  refine¬ 
ment  Assume  that  (e.g.)  we  use  the  scale  hj  =  e~>. 
As  the  method  is  of  the  order  p  the  discretization  error  is 
proportional  to  and  thus  with  step  size  hj  we  iterate 
so  long  that  the  iteration  error  gets  reduced  by  the  factor 
e"**.  As,  by  then,  p  ~  1  -  e~^,  this  means  ~  sweeps 
and  the  total  work  with  step  size  hj  is  proportional  to 
pe^^.  We  get  down  to  tolerance  level  when  ~  i.e. 
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for  JV  ~  i  log  Thus  the  total  amount  of  work,  say 
Wt ,  satisfies 


~  +  e-^  +  e-*  + 


(i)Vp 

e 


which  means  that  the  gain  is  a  factor  log  over 

•  This  is  the  same  gain  as  what  one  (Stains  in  the 
geometrically  converging  case  [8]. 
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Abstract 

In  this  paper  we  present  a  study  of  static  circuit  par¬ 
titioning  algorithms  for  waveform  relaxation  used  for 
circuit  simulation  on  a  multicomputer.  We  investigate 
the  important  tradeoff  between  the  irregularity  of  the 
partitioning  and  the  achievable  parallelism.  Also,  the 
importance  of  the  accuracy  in  certain  steps  in  the  par¬ 
titioning  calculation  has  been  studied.  Purely  topolog¬ 
ical  methods  are  compared  with  methods  that  try  to 
quantitatively  estimate  the  convergence  factor  of  the 
WR  iterations.  For  digital  CMOS  circuits  we  find  that 
only  the  close  neighborhood  of  a  circuit  node  (nearest 
neighbors)  influences  the  value  of  the  worst-case  cou¬ 
pling.  The  conductive  coupling  is  the  essential  part  to 
take  into  account  for  the  partitioning.  For  a  few  circuits 
(mostly  circuits  with  cross-coupling)  the  capacitive  cou¬ 
pling  must  also  be  included  in  the  partitioning.  It  is, 
however,  hard  to  find  an  exact  limit  for  the  convergence 
factor  estimate. 


1  Introduction 

It  is  extremely  CPU-time  consuming  to  perform  accu¬ 
rate  timing  verification  for  the  large  integrated  circuits 
that  are  possible  to  fabricate  today.  For  many  cir¬ 
cuits  the  simulation  run  times  have  been  reduced  by 
the  use  of  logic  simulators  and  switch-level  simulators. 
However,  circuit  designers  still  need  circuit  simulation 
programs  that  do  not  trade  their  accuracy  for  speed, 
both  to  simulate  digital  and  analog  circuits.  One  way 
to  shortening  the  circuit  simulation  run  times,  without 
sacrificing  the  accuracy,  is  to  use  concurrent  comput¬ 
ers.  The  traditional  algorithms  used  for  circuit  simu¬ 
lation  a  .  include  the  solving  of  a  large  linear  system. 
This  part  of  the  simulation  program  accounts  for  10  - 
25  %  of  the  totai  runtime,  depending  on  the  circuit  size. 


and  is  not  easily  parallelizable.  Thus,  when  we  want  to 
use  more  than  4-10  processors  efficiently  (according 
to  Amdahl’s  law)  we  have  to  turn  to  alternative  algo¬ 
rithms. 

One  highly  promising  algorithm,  which  is  used  for  cir¬ 
cuit  simulation  and  also  for  other  applications,  is  the 
waveform  relaxation  (WR)  method  [4,  2].  This  algo¬ 
rithm  requires  that  the  equation  system  to  be  solved 
is  partitioned  into  subsystems.  For  circuit  simulation, 
to  partition  the  equation  system  is  really  the  same  as 
to  partition  the  circuit  to  be  simulated.  The  objec¬ 
tive  of  this  study  is  to  investigate  several  partition¬ 
ing  algorithms  and  specially  their  impact  on  run  times 
and  achievable  parallelism  for  WR  run  on  a  multi¬ 
computer.  By  multicomputer  we  mean  a  concurrent 
message-passing  computer  that  has  a  local  memory 
space  for  each  processing  node. 


2  Waveform  Relaxation 

Waveform  relaxation  is  an  iterative  method  that  cein  be 
used  for  performing  transient  analysis.  That  is,  it  is  a 
method  that  can  solve  the  equations  formed  by  applying 
Kirchoffs  current  law  (or  Kirchoffs  voltage  low,  or  both) 
to  the  description  of  a  circuit.  These  equations,  repre¬ 
senting  the  circuit  dynamics,  form  a  system  of  ordinary 
differential  equations  (ODEs).  Such  a  system  can  be 
partitioned  into  several  subsystems  each  containing  at 
least  one  ODE.  Using  the  waveform  relaxation  method, 
one  can  solve  these  subsystems  with  only  little  interac¬ 
tion  between  the  subsystems.  When  solving  one  of  the 
subsystems  all  the  other  ones  are  relaxed,  that  is,  the  so¬ 
lutions  from  these  subsystems  are  assumed  to  be  static. 
This  way,  it  is  possible  to  solve  for  the  state  variables  of 
one  subsystem  over  the  total  simulation  interval  with¬ 
out  interacting  with  the  other  subsystems.  Thus,  one 
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Figure  1;  Program  flow  for  the  waveform  relaxation  method. 


obtains  functional  approximations  of  all  the  state  vari¬ 
ables  over  the  simulation  interval,  that  is,  waveforms 
for  the  state  variables.  Each  global  WR  iteration  con¬ 
sists  of  one  such  solution  round  for  all  the  subsystems. 
Different  iteration  schemes  may  be  used  for  the  WR  it¬ 
erations,  the  most  common  ones  are  Gauss-Seidel  and 
Jacobi  iterations.  The  partitioning  of  the  ODE  system 
is  extremely  important  for  the  convergence  speed  of  the 
WR  method  and  is  thus  important  for  the  successful 
use  of  the  method. 

In  a  practical  implementation  of  the  WR  method  one 
does  not  solve  for  the  total  simulation  interval  in  one 
iteration.  Such  an  approach  is  extremely  memory  con¬ 
suming.  Furthermore,  much  effort  is  spent  on  comput¬ 
ing  the  end  of  the  waveforms  which  do  not  contain  any 
information  during  the  first  iterations.  Instead,  the  to¬ 
tal  simulation  interval  is  divided  into  shorter  time  win¬ 
dows.  Each  such  time  window  is  then  solved  using  the 
WR  method.  In  our  program,  CONCISE,  the  time  win¬ 
dows  are  precomputed  from  the  input  waveforms.  This 
approach  helps  the  integration  algorithm  to  home  in  on 
discontinuities  in  the  state  variable  waveforms  [5].  The 
program  flow  for  the  WR  method  is  shown  in  Figure  1. 


3  Coupling  in  an  MOS  Circuit 

The  objective  of  the  partitioning  methods  used  for  the 
WR  relaxation  method  is  to  cluster  equations  that 
are  strongly  coupled  into  the  same  subsystem.  Thus, 


across 


Figure  2:  Electrical  examples  of  coupling.  The  MOS  tran¬ 
sistor  on  the  left  exhibits  unidirectional  coupling  from  gate 
to  source  whereas  the  conductor  on  the  right  is  an  example 
of  bidirectional  coupling. 


the  couplings  between  the  resulting  subsystems  will  be 
loose.  The  couplings  between  neighbor  circuit  nodes  in 
the  circuits  are  of  two  types,  unidirectional  and  bidi¬ 
rectional  couplings.  In  MOS  circuits,  the  unidirectional 
couplings  are  the  ones  from  gate  to  drain  and  gate  to 
source.  The  main  bidirectional  couplings  are  the  ones 
between  drain  and  source.  Furthermore,  there  are  ca¬ 
pacitive  bidirectional  couplings  between  the  terminals 
of  the  MOS  transistor.  Parasitic  capacitors  and  con¬ 
ductors,  which  are  due  to  the  the  internal  structures 
in  the  integrated  circuits,  also  cause  bidirectional  cou¬ 
pling. 

Widening  our  view  to  the  entire  circuit,  we  find  that 
there  is  yet  another  way  to  characterize  coupling  —  the 
coupling  may  be  either  local  or  global.  Local  coupling  is 
bidirectional  coupling  between  neighbor  circuit  nodes. 
The  local  coupling  usually  stems  from  the  bidirectional 
coupling  as  described  above,  but  it  could  also  be  due 
to  two  unidirectioned  couplings,  one  in  each  direction. 
In  an  MOS  circuit  the  latter  type  would  correspond  to 
a  pair  of  cross-coupled  transistors.  Global  coupling  is 
due  to  a  closed  loop  of  unidirectional  couplings.  There 
is  also  the  less  severe  case  of  an  non-closed  loop  of  uni¬ 
directional  couplings. 

In  this  study  we  have  concentrated  on  the  local  coupling 
stemming  from  bidirectional  couplings  between  circuit 
nodes.  The  local  coupling  due  to  unidirectional  cou¬ 
pling  is  easily  taken  care  of  by  a  routine  that  scans  the 
circuit  for  cross-coupled  transistors. 

The  global  coupling  need  an  entirely  different  approach 
from  the  local.  Usually  the  closed  loops  are  found  by 
dataflow  or  graph  algorithms  [2].  Such  a  method  is  not 
yet  included  in  our  program  CONCISE.  The  nonclosed 
chains  of  unidirectional  couplings  are  usually  taken  into 
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account  when  iteration  schemes  of  Gauss-Seidel  type 
are  used.  The  equations  are  then  ordered  according  to 
the  directions  of  these  unidirectional  couplings.  For  the 
experiments  described  in  this  paper  we  have  used  Jacobi 
iterations  and  thus  we  have  not  considered  this  type  of 
coupling. 


4  Partitioning  Methods 

The  partitioning  methods  used  for  circuit  simulation 
applications  can  be  divided  into  two  groups.  The 
first  group  contains  purely  topological  methods,  that  is 
methods  that  only  take  the  structure  of  the  circuit  into 
account  when  partitioning  the  circuit.  In  addition  to 
the  structure  of  the  circuit,  the  methods  in  the  second 
group  also  consider  quantitative  information,  for  exam¬ 
ple  the  values  and  sizes  of  components  in  the  circuit, 
when  partitioning  the  circuit. 

Topological  Methods 

Most  of  the  methods  in  the  first  group  are  rather  sim¬ 
ilar.  The  idea  behind  these  methods  is  to  cluster  cir¬ 
cuit  nodes  which  at  DC  have  a  conducting  path  be¬ 
tween  them.  For  purely  digital  circuits  without  parar 
sitic  conductances  this  partitioning  strategy  is  equiva^ 
lent  to  clustering  circuit  nodes  connected  by  the  source 
and  drain  of  an  MOS  transistor.  Thus,  in  this  paper 
we  call  our  version  of  this  method  the  source-drain  par¬ 
titioning  method.  Methods  of  this  DC-path  type  for 
the  WR  algorithm  have  been  described  in  [2,  3]  and 
the  circuit  partitioning  method  used  in  the  switch-level 
simulator  MOSSIM  [1]  is  also  similar.  These  methods 
are  motivated  by  the  unidirectionality  of  the  MOS  tran¬ 
sistor  and  are  therefore  mainly  suited  for  MOS  circuits. 


Quantitative  Methods 

The  methods  of  the  second  group  try  to  estimate  how 
fast  the  WR  iterations  will  converge.  To  explain  how 
to  arrive  at  such  an  estimate  we  must  take  a  closer  look 
at  the  WR  method.  The  convergence  proofs  for  the 
WR  method  prove  the  method  to  be  a  contraction  map 
in  waveform  space  under  certain  fairly  easily-full-filled 
conditions.  That  is,  WR  is  shown  to  be  a  map  F  in  a 
waveform  space  Y  so  that  F{Y)  €  Y  and  for  some  norm 


Figure  3:  Simple  2-node  circuit  that  exhibits  bidirectional 
coupling. 

holds.  For  such  a  contraction  map  the  rate  of  conver¬ 
gence  is  defined  as 

II  y*  -  y*  ll<  II  y®  -  y*  II. 

where  y*  is  the  initial  waveform  and  y*  is  the  fixed  point 
of  the  iteration  (that  is  the  solution).  Consequently,  7 
is  called  the  convergence  factor. 

For  the  methods  in  the  second  group  the  goal  is  to  com¬ 
pute  an  estimate  of  the  resulting  convergence  factor  for 
each  pair  of  circuit  nodes  if  these  nodes  were  limiting 
for  the  convergence  factor  for  the  total  circuit.  The  two 
circuit  nodes  are  clustered  if  the  estimate  is  higher  than 
a  user-specified  threshold  value.  Usually,  the  estimate 
of  the  convergence  factor  is  computed  for  the  worst  case 
circuit  state  since  it  impossible  to  know  what  the  the 
interna]  state  of  the  circuit  will  be  during  the  Simula^ 
tion.  Thus,  these  methods  are  often  overly  pessimistic. 
A  typical-case  estimate  would  probably  be  a  more  ap¬ 
propriate  for  concurrent  implementation  of  WR  where 
the  price  for  large  subsystems  is  high.  However,  such 
an  estimate  is  hard  to  calculate  since  it  is  impossible  to 
know  what  the  typical  state  for  the  circuit  is.  Thus,  we 
have  remained  at  the  worst-case  estimate  in  this  inves¬ 
tigation. 

5  Calculation  of  the 
Convergence  Factor  Estimate 

Each  pair  of  circuit  nodes  that  exhibit  bidirectional  cou¬ 
pling  may  be  viewed  as  a  2  x  2-system.  The  coupling  of 
the  circuit  nodes  is  modeled  as  a  nonlinear  admittance 
connecting  the  circuit  nodes.  The  impact  of  the  rest  of 
the  circuit  at  each  of  the  circuit  nodes  is  lumped  in  the 
Norton  equivalent  admittance,  that  is  the  same  as  the 
driving  point  admittance. 


II  F(y)  -  F(j;)  II  <  7  (|  x  -  y  ||  forallx.yeY 

7e[0,l[ 
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For  linear  systems  the  convergence  constant  is  the  spec¬ 
tral  radius  of  the  iteration  matrix.  For  both  the  Gauss- 


Seidel  and  the  Jacobi  iteration  schemes  this  expression 
is  of  the  same  form,  only  differing  in  the  square  root 
present  in  the  Jacobi  case.  Thus,  we  get  the  following 
expression  as  2in  estimate  for  the  coupling 

_  I  Tocro»»  I  I  ^aerott  I 

7coup  -  ^Yi+  Yacro,.  I  '  I  ^2  +  I '  ^  ^ 

By  using  the  largest  possible  value  for  the  admittance 
between  the  nodes  and  the  smallest  for  the  two  driving 
point  admittances  one  obtains  a  worst-case  value  for  the 
convergence  factor. 


6  The  Diagonal  Dominance 
Norton  Method 


The  above  method  of  estimating  the  convergence  factor 
was  first  described  as  the  diagonal  dominance  Norton 
(DDN)  partitioning  in  [11].  When  one  uses  a  differ¬ 
ence  approximation  for  the  derivative,  the  approxima¬ 
tion  used  in  the  Jacobian  matrix  for  a  linear  admittance 
is 

(2) 

n 

where  ^  is  the  derivative  operator  consisting  of  h,  the 
current  time  step  and  oq  the  zeroth  coefficient  of  the 
integration  formula.  At  DC  (A  large),  the  conductance 
part  is  dominating  but  at  transients  (h  small)  the  capac¬ 
itance  will  dominate  the  coupling.  Thus,  in  [11]  the  cou¬ 
pling  is  estimated  for  the  two  extremes,  firstly  when  the 
conductance  is  totally  dominant  (frequency  is  0),  and 
secondly  when  the  capacitances  are  totally  dominant 
(frequency  is  high).  This  way,  the  capacitive  coupling 
and  the  capacitive  coupling  are  computed  separately  as 


and 


ycond  — 


7cap  — 


Ga 


Ga 


G\  -f-  Ga 


G2  -h  Ga 


(3) 


(4) 


Ci-I-  (^across  C2  +  Gacrots 
The  conductive  and  capacitive  clusterings  are  per¬ 
formed  separately.  The  compound  partitioning  is  the 
union  of  the  both  partitionings  such  that  two  nodes 
that  belong  to  the  same  subnetwork  in  either  of  the  two 
partitionings  has  to  belong  to  the  same  subnetwork  in 
the  compound  partitioning. 


As  mentioned  above,  the  internode  admittance  should 
be  maximized  and  the  node-to-ground  admittance  min¬ 
imized  to  find  the  worst-case  coupling.  Let  us  consider 
the  pure  conductance  coupling  as  in  equation  (3).  The 
minimum  conductance  between  source  and  drain  of  an 
MOS  transistor  is  identical  to  zero.  Thus,  for  circuits 


where  the  only  conductive  elements  are  source-drain 
connections,  the  coupling  when  conductance  is  domi¬ 
nating  (at  DC)  will  be  identical  to  one.  Subsequently, 
the  conductance  part  of  the  DDN  algorithm  is  iden¬ 
tical  to  the  topological  S-D  method  for  circuits  where 
the  only  conductances  are  those  of  the  MOS  transistors. 
Thus,  the  DDN  method  adds  the  effect  of  the  capacitive 
coupling  to  the  S-D  method  for  such  circuits. 


7  The  Admittance  Matrix 
Method 


As  noted  in  [11],  the  DDN  algorithm  is  pessimistic  and 
may  give  unnecessarily  large  subnetworks.  We  have  al¬ 
ready  discussed  the  problem  of  using  the  worst-case  con¬ 
vergence  factor.  One  way  of  finding  a  more  realistic 
value  for  this  worst-case  coupling  is  to  use  the  small¬ 
est  and  largest  time  step  permitted  by  the  simulation 
program  for  the  two  extremes.  This  change  has  been 
suggested  in  [2].  Another  approach  is  to  use  the  intrin¬ 
sic  time  constant  of  the  components  to  find  the  highest 
possible  time  constant  of  the  signals  inside  the  circuit. 
We  have  chosen  the  latter  approach  for  our  admittance 
matrix  method. 


8  Computing  the  Driving  Point 
Admittance 


The  computationally  most  expensive  part  of  all  meth¬ 
ods  that  use  the  convergence  factor  estimate  from  equa^ 
tion  1  is  to  calculate  the  driving  point  admittance  val¬ 
ues,  that  is  Gi  and  G2in  Figure  1.  In  [11]  a  recursive 
depth-first  algorithm  for  this  calculation  is  given.  For  a 
ladder-type  circuit  this  algorithm  computes  the  correct 
admittance,  but  for  other  topologies  it  only  gives  an  ap¬ 
proximation  since  it  includes  each  node  only  once.  The 
authors  note  in  [11]  that  the  recursion  will  not  be  deep 
since  the  minimum  conductance  of  the  source  drain  con¬ 
nection  is  zero.  This  observation  holds  for  the  purely 
conductive  case,  but  when  capacitances  are  included  the 
recursion  will  continue  throughout  the  circuit. 

For  the  admittance  method  (AM)  we  have  implemented 
an  algorithm  that  gives  the  correct  driving  admittance 
for  a  node  in  an  arbitrary  linear  circuit.  The  circuit  con¬ 
sidered  is  the  original  one  with  the  across-admittance 
removed  and  all  other  connections  replaced  by  their 
minimum  admittance.  When  Y  denotes  the  admittance 
matrix  formulated  by  nodal  analysis  the  driving  point 
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admittance  can  be  computed  as 


(5) 

where  Yj  is  the  first-order  cofactor  of  Y  with  respect 
to  row  and  column  j  [6]. 

In  our  implementation  the  matrix  size  can  be  limited  by 
only  including  circuit  nodes  a  limited  number  of  neigh¬ 
bor  hops  from  the  original  circuit  node.  We  denote  the 
resulting  methods  AMO  (0  hops),  AMI  (1  hop),  and  so 
on  to  AMoo  (the  entire  circuit).  The  matrix  method 
is  computationally  highly  expensive  when  the  matrix  is 
large  but  by  using  this  method  we  can  get  an  impres¬ 
sion  of  how  important  distant  circuit  nodes  are  to  the 
resulting  admittance  value. 

One  interesting  observation  is  that  by  taking  more 
nodes  into  account  (deeper  recursion  or  larger  matrix) 
when  computing  the  driving  point  admittance  this  ad¬ 
mittance  can  only  increase,  not  decrease.  That  is,  the 
coupling  will  always  decrease  when  we  look  further  into 
the  circuit.  Thus,  we  can  start  our  coupling  calcula¬ 
tion  by  computing  an  approximate  value  for  the  cou¬ 
pling.  This  we  do  by  taking  only  the  admittances  con¬ 
nected  directly  to  the  node  in  question  into  account. 
If  this  appruxiiuate  coupling  is  lower  than  the  thresh¬ 
old  for  clustering  a  thorough  calculation  of  the  driving 
point  admittance  is  unnecessary.  This  way  the  com¬ 
putational  requirements  can  be  lowered  both  for  the 
matrix  method  and  the  depth-first  algorithm. 


det(Y) 

det{Yj) 


9  The  Program 

The  program  used  for  the  test,  called  CONCISE,  is 
a  circuit  simulator  for  transient  analysis  of  CMOS 
circuits.  It  is  written  in  C  and  uses  the  Cosmic 
Environment/ Reactive  Kernel  message-passing  primi¬ 
tives  [10].  These  primitives  support  the  programming 
model  where  each  process  h2is  its  own  memory-space. 
This  model  makes  dynamic  partitioning  and  load  bal¬ 
ancing  CPU-time  expensive  and  thus  the  study  was  lim¬ 
ited  to  static  schemes  where  the  partitioning  and  place¬ 
ment  remain  fixed  throughout  the  computation.  It  is 
important  to  notice  that  the  requirements  on  the  par¬ 
titioning  algorithms  in  this  case  differ  from  the  “tra¬ 
ditional”  parallelization  where  only  a  few  processing 
nodes  are  used.  The  load  balancing  considerations  be¬ 
come  much  more  difficult  when  the  number  of  process¬ 
ing  nodes  are  about  the  same  as  the  number  of  circuit 
nodes.  The  computer  used  in  the  study  was  a  64-node 
Symult  82010. 


Table  1:  Description  of  the  circuits  used  for  the 
experiments. 


Circuits  used  for  the  experiments  || 

circuit 

trans 

nodes 

description 

adder 

262 

105 

4-bit-wide  slice  of  an 
NMOS  multiplier  using 
the  Booth  algorithm. 

contr 

701 

355 

Control  part  of  signal 
processing  chip.  This  cir¬ 
cuit  is  comprised  of  the 
circuits  proc,  inblock 
and  sign. 

dr  an 

793 

535 

Dynamic  RAM  with  7-bit 
address  and  3-bit  data. 

delay 

2844 

1944 

4  shift  register  delay  lines 
of  128  stages  each  and  16 
8-to-l  demultiplexers  and 

1  2-to-4  demultiplexer. 

Inblock 

221 

119 

Mixed  parts  for  the  con¬ 
troller.  A  small  PLA  (3 
inputs,  2  outputs),  2  full- 
adders,  5  latches,  and  12 
AND  gates. 

mult 

3134 

1739 

8x8  multiplier  using  the 
Booth  algorithm. 

pla 

1428 

170 

Pseudo-NMOS  PLA  with 

5  inputs,  64  outputs,  and 

32  rows. 

proc 

240 

92 

Finite  state  machine  us¬ 
ing  pseudo-NMOS  PLA 
with  10  inputs,  6  outputs, 
and  23  rows.  5  latches  for 

the  state. 

reg 

1920 

1152 

Two  64-bit  wide  shift  reg¬ 
isters  with  64  latches  be¬ 
tween. 

regpla 

3348 

1322 

The  circuits  reg  and  pla 
combined. 

ran 

209 

122 

4-bit  NMOS  RAM.  De¬ 
signed  with  hot-clock 
techniques. 

ram2 

1153 

625 

7-bit  RAM  with  7-bit  ad¬ 
dress  and  3-bit  data. 

rom 

414 

240 

ROM  with  7-bit  address 
and  3-bit  data. 

sign 

240 

144 

Two  8-bit  wide  shift  reg¬ 
isters  (with  parallel  load) 
with  8  latches  between 
them. 
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Figure  4:  Runs  for  <idder  circuit.  The  partitionings  are 
pointwise  partitioning  (+),  source-drain  partitioning  (X), 
and  DDN  partitioning  depth  0  with  7  =  0.1  (^). 


Table  2:  The  convergence  factor  threshold  at  which  the 
capacitive  partitioning  starts  merging  nodes  for  various 
methods  of  computing  the  driving  point  admittance.  DDN 
is  the  method  proposed  in  [11]  and  AM  is  the  admittance 
matrix  method. 


1  Value  for  7m«n  for  capacitive  partitioning  | 

ddnO 

ddnoo 

2unn0 

ami 

am2 

adder 

0.13 

0.12 

0.13 

0.11 

0.11 

contr 

0.02 

0.02 

0.02 

0.02 

0.02 

dram 

0.08 

0.07 

0.08 

0.08 

0.08 

delay 

0.04 

0.04 

0.04 

0.03 

0.03 

inblock 

0.02 

0.02 

0.02 

0.02 

0.02 

mult 

0.10 

0.09 

0.10 

0.10 

0.10 

pla 

0.001 

0.001 

0.001 

0.001 

0.001 

proc 

0.009 

0.009 

0.009 

0.009 

0.009 

reg 

0.01 

0.01 

0.01 

0.01 

0.01 

regpla 

0.01 

0.01 

0.01 

0.01 

0.01 

ram 

0.09 

0.09 

0.09 

0.09 

0.09 

ram  2 

0.07 

0.07 

0.07 

0.07 

0.07 

rom 

0.08 

C.07 

0.08 

0.08 

0.08 

sign 

0.01 

0.01 

0.01 

0.01 

0.01 

10  Experiments 


The  circuits  used  in  the  experiments  are  all  digital  MOS 
circuits.  All  in  all  there  are  14  circuits.  Two  of  them, 
adder  and  ram,  are  4/i  NMOS  circuits  that  employ  hot- 
clock  techniques  [9].  The  other  twelve  test  examples  are 
2^  CMOS  circuits.  All  fourteen  of  them  come  from  re¬ 
search  chips  designed  either  in  Lund  or  at  Caltech.  The 


circuits  are  described  in  table  1.  The  netlists  for  2ill 


Figure  5:  Runs  for  inblock  circuit.  The  partitionings  are 
pointwise  partitioning  (+),  source-drain  partitioning  (X), 
and  DDN  partitioning  depth  0  with  7  =  0.02  (♦). 


the  circuits  were  extracted  from  the  chip  layouts  by  the 
built-in  extractor  in  the  chip-design  system  Magic  [8]. 
This  extractor  extracts  internode  capacitances,  but  not 
internode  resistances.  Furthermore,  the  extractor  does 
not  extract  the  size  of  the  source  and  drain  areas  of  the 


MOS  transistors.  These  areas  are  needed  since  they  de¬ 
fine  the  sizes  of  the  source-bulk  and  drain-bulk  diodes. 
For  our  test  circuits  we  have  used  the  same  size  for  all 
these  diodes. 

All  the  circuits  have  been  simulated  using  CONCISE 
with  pointwise  partitioning  (one  ODE  per  subnetwork), 
source-drsiin  partitioning,  and  with  the  different  meth¬ 
ods  that  also  consider  the  capacitive  coupling. 

Capacitive  Coupling 

We  would  like  to  investigate  the  difference  between  the 
coupling  values  computed  by  the  various  methods  which 
consider  the  capacitive  coupling.  To  get  a  picture  of  the 


Figure  6;  Runs  for  sign  circuit.  The  partitionings  are  point- 
wise  partitioning  {+),  source-drain  partitioning  (X),  and 
DDN  partitioning  depth  0  with  7  =  0.01  (♦). 
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Table  3;  Comparison  between  pointwise,  so<jrce-dTain(conductive),  and  DDN(conductive  and  capacitive)  partitioning.  The 
CPU  time  is  the  for  the  transient  analysis  part  run  in  1  node  except  where  stated  otherwise. 


pointwise 

source-drain 

DDNO  1 

circuit 

ckt 

CPU 

iter/ 

#of 

max 

7 

CPU 

iter/ 

#of 

max 

nds 

time 

wind 

subs 

size 

time 

subs 

size 

adder 

8987 

19.41 

105 

9650 

9.55 

50 

20 

4890 

4.20 

26 

24 

contr 

2294 

13.12 

355 

6744 

4.07 

175 

7 

8929 

3.57 

141 

11 

delay^ 

39636 

15.43 

1944 

16886 

3.07 

632 

9 

0.04 

14908 

2.71 

600 

21 

dram 

5838 

10.07 

535 

15223 

5.74 

118 

50 

0.07 

13211 

6.00 

116 

50 

inblock 

3659 

9.69 

119 

1399 

3.60 

67 

7 

0.02 

1389 

3.51 

66 

8 

mult^ 

138852 

18.06 

1739 

65998 

7.12 

700 

11 

0.04 

75851 

6.54 

537 

13 

pla 

2622 

4.00 

170 

2622 

4.00 

170 

1 

0.001 

5272 

4.00 

115 

2 

proc 

3949 

10.31 

92 

1888 

3.75 

68 

3 

0.009 

2649 

3.38 

62 

6 

ram 

14539 

22.79 

122 

2255 

4.07 

64 

7 

0.09 

2310 

4.04 

62 

7 

ram2'* 

32654 

19.47 

625 

41311 

8.50 

119 

72 

0.05 

36969 

8.00 

117 

72 

reg^ 

23512 

11.80 

1152 

10438 

2.86 

325 

5 

0.01 

12432 

2.52 

197 

8 

regpla* 

72590 

19.64 

1322 

23019 

4.58 

428 

5 

0.01 

44949 

5.30 

300 

8 

rom 

12400 

19.45 

240 

4488 

8.46 

103 

8 

0.08 

4245 

7.92 

102 

8 

sign 

4431 

11.88 

144 

2263 

3.15 

40 

5 

0.01 

2896 

2.18 

10 

64 

1.  CPU  time  is  total  for  run  in  8  nodes. 

2.  CPU  time  is  total  for  run  in  16  nodes, 

3.  CPU  time  is  total  for  run  in  4  nodes. 

4.  CPU  time  is  total  for  run  in  2  nodes. 


size  of  the  capacitive  couplings  we  decreased  the  con¬ 
vergence  factor  threshold  in  small  steps  (usually  0.01) 
until  nodes  were  merged  due  to  the  capacitive  coupling 
for  these  various  methods.  The  resulting  convergence 
factor  thresholds  are  shown  in  Table  2.  Comparing  the 
resulting  partitionings  from  the  methods  that  consider 
the  capacitive  couping,  we  find  only  small  differencies 
when  we  use  the  same  y.  The  major  differencies  that 
can  be  found  when  different  capacitive  methods  are  used 
is  due  to  the  j  threshold  used.  Only  minor  differencies 
are  found  due  to  the  partitioning  method  used  when 
the  same  threshold  is  used.  Thus,  we  will  only  consider 
one  method  when  investigating  the  capacitive  coupling 
methods  compared  to  the  source-drain  and  pointwise 
partitioning  methods. 

Computing  the  Driving  Point  Admit¬ 
tance 

From  Table  2  is  seems  probable  that  the  impact  of  far¬ 
away  circuit  nodes  on  the  calculated  coupling  is  small. 
To  further  investigate  the  the  impact  of  such  distant 
circuit  nodes  on  the  calculated  coupling  we  compared 
the  exact  values  for  the  driving  point  admittance  with 
approximate  values.  This  we  did  for  the  circuits  which 
we  had  found  to  have  severe  capacitive  coupling  and  for 


the  circuit  nodes  in  these  circuits  where  merging  due 
to  capacitive  coupling  occurred.  The  exact  admittance 
values  where  calculated  using  the  admittance  matrix 
method  with  infinite  matrix  size  (that  is  including  all 
of  the  circuit),  this  we  call  AMoo.  The  approximate 
values  where  computed  using  the  AM  method  but  with 
limited  matrix  size  (AMO,  AMI,  and  AM2).  In  all  cases 
we  came  within  O.ldepth  1  was  used.  This  means  one 
needs  only  include  the  circuit  node  itself  and  its  nearest 
neighbors  in  the  driving  point  admittance  calculation. 

Run  Times  and  Convergence  Rate 

We  would  like  to  experimentally  verify  the  positive  ef¬ 
fects  on  the  convergence  rates  of  the  more  sophisticated 
partitioning  methods.  There  is  no  way  to  directly  mea¬ 
sure  the  convergence  rate.  However,  it  is  still  possible  to 
get  an  approximate  value  of  the  convergence  rate  from 
the  mean  number  of  WR  iterations  needed  per  window 
which  is  a  fairly  good  estimate  of  the  convergence  rate* . 
From  Table  3  we  find  that  the  S-D  partitioning  increases 
the  convergence  speed  with  a  factor  2-5.5  over  pointwise 

*When  convergence  is  slow  time  windows  are  split  to  try  to 
improve  convergence.  Thus,  the  convergence  rate  estimate  is  too 
low  in  these  cases.  This  incorrectness  will  make  the  improvements 
due  to  better  partitioning  methods  look  smaller  than  they  really 
are. 
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Figure  7:  Parallelism,  that  is  inverted  speedup  for  all  the 
circuits  for  pointwise  partitioning.  The  dotted  line  is  the 
theoretical  limit. 


Figure  8:  Inverted  speedup,  for  all  the  circuits  for  source- 
drain  partitioning.  The  dotted  line  is  the  theoretical  limit. 


Figure  9:  Inverted  speedup,  for  all  the  circuits  for  DDN 
partitioning.  The  dotted  line  is  the  theoretical  limit. 


partitioning.  As  expected,  an  increase  in  convergence 
speed  gives  a  decrease  in  runtime.  The  run  time  is,  how¬ 
ever,  not  always  decreased  with  as  laurge  a  factor  since 
the  larger  subnetworks  resulting  from  source-drain  par¬ 
titioning  take  longer  to  evaluate  than  the  trivial  ones 
from  the  pointwise  partitioning.  This  effect  is  even  more 
pronounced  when  the  capacitive  effect  is  also  taken  into 
account  (DDN  method).  Then  the  only  substantial  im¬ 
provement  in  execution  time  is  for  the  adder  circuit. 

Parallelism 

In  Table  3  we  find  that  for  some  circuits  large  subnet¬ 
works  are  obtained  when  S-D  or  DDN  partitioning  is 
used.  The  achievable  parallelism  is  severely  limited  for 
these  circuits  as  can  be  seen  when  comparing  the  dia¬ 
grams  in  Figures  7,  8,  and  9.  In  these  figures  we  show 
curves  for  all  the  test  circuits  together  in  order  to  be 
able  to  show  all  the  data  in  limited  space.  The  addition 
of  the  part  that  takes  the  capacitive  coupling  into  ac¬ 
count  does  increase  the  convergence  speed  slightly  for 
most  of  the  circuits  but  the  parallelism  is  reduced  due 
to  large  subnetworks  in  some  cases.  In  Table  3  we  find 
that  only  for  adder  does  the  the  inclusion  of  the  capac¬ 
itive  partitioning  significantly  improve  the  convergence 
speed  and  the  execution  time.  To  get  a  better  under¬ 
standing  of  the  behavior  we  take  a  closer  look  at  the 
curves  for  three  of  the  circuits  in  Figures  4,  5,  and  6. 
We  also  consult  Table  3  to  see  the  convergence  rate. 
For  adder  we  find  that  the  S-D  partitioning  increases 
the  convergence  rate  but  spoils  the  parallelism.  The  ad¬ 
dition  of  the  capacitive  part  increases  the  convergence 
rate  further  and  reduces  the  runtime,  but  of  course  the 
parallelism  is  still  poor.  It  is  worth  noticing  that  we  had 
to  try  several  7  thresholds  to  find  the  one  (0.07)  which 
gives  this  increase  in  convergence  rate.  For  inblock 
the  conductive  p^t  is  the  important  one.  The  addition 
of  the  capacitive  part  makes  no  difference.  For  sign 
the  addition  the  capacitive  part  both  increases  the  run¬ 
time  an  J  ruins  the  parallelism.  Due  to  lack  of  space  we 
have  left  out  the  diagrams  for  the  other  circuits.  These 
diagrams  can  be  found  in  [7]. 

Speedup 

One  interesting  figure  for  programs  like  this  one  is 
speedup.  The  parallelism  curves  may  look  discouraging 
for  the  more  sophisticated  partitioning  methods.  We 
have  compared  CONCISE  running  the  traditional  di¬ 
rect  circuit  simulation  method,  where  all  the  circuit  is 
treated  as  one  large  subnetwork,  with  CONCISE  run¬ 
ning  the  WR  method.  These  comparisons  may  be  seen 
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Figure  10:  Inverted  speedup  for  CONCISE  using  the  point- 
wise  partitioning  normalized  to  the  execution  time  for  CON- 
CISE’s  direct  method  (the  horizontal  dotted  line).  The 
sloping  dotted  line  is  the  ideal  parallelization  of  the  direct 
method. 


N  (number  of  nodes) 


Figure  11:  Inverted  speedup  for  CONCISE  using  the  source- 
drain  partitioning  normalized  to  the  execution  time  for 
CONCISE’S  direct  method  (the  horizontal  dotted  line). 


in  Figures  10  and  11.  In  these  diagrams  all  the  execu¬ 
tion  times  for  the  WR  method  are  normalized  to  the 
execution  time  of  the  the  direct  method,  that  is  “the 
best  sequential  algorithm” .  Thus,  we  find  that  the  S-D 
method  is  not  a  bad  choice  even  if  we  could  gain  even 
more  if  the  parallelism  was  not  limited  by  large  subnet¬ 
works  for  some  of  the  circuits. 


11  Conclusions  and  Discussion 

The  WR  method  using  pointwise  partitioning  converges 
for  all  our  test  circuits.  However,  the  convergence  is 
slow  for  several  of  them.  When  the  source-drain  method 
or  other  methods  that  partition  due  to  conductive  cou¬ 
pling  is  used,  the  convergence  speed  is  increased  for  all 
the  test  circuits.  For  a  few  of  the  circuits  the  capacitive 
coupling  is  also  strong  and  partitioning  that  considers 
capacitive  coupling  is  needed.  Thus,  we  conclude  that 
we  need  to  employ  both  conductive  and  capacitive  par¬ 
titioning  to  get  a  reasonable  convergence  speed.  For 
some  circuits  the  partitioning  will  create  large  subnet¬ 
works,  which  severely  reduce  the  achievable  pcirallelism. 
For  some  circuits  dynamic  partitioning,  which  can  use 
information  about  the  current  state  of  the  circuit,  will 
help  in  reducing  the  size  of  large  subnetworks,  but  for 
others  it  will  not.  Thus,  wt  need  to  be  able  to  assign 
more  than  one  node  to  perform  the  calculations  for  a 
large  subnetwork. 

Furthermore,  the  experiments  show  the  difficulty  in 
finding  one  fixed  convergence  constant  threshold  that  is 
the  optimal  for  all  circuits.  It  is  obvious,  however,  that 
the  admittance  calculation  need  not  be  extended  fur¬ 
ther  than  one  hop,  that  is,  it  should  include  the  circuit 
node  in  question  and  its  neighbor  circuit  nodes.  Thus, 
one  can  skip  the  deep  recursions  in  the  admittance  cal¬ 
culations  and  use  the  saved  CPU-time  for  trying  sev¬ 
eral  threshold  values  when  considering  the  capacitive 
coupling.  This  can  prove  useful  when  running  on  mul¬ 
ticomputers  where  an  jextremely  large  subnetwork  may 
entirely  spoil  the  achievable  parallelism. 
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Abstract 

The  most  time  consuming  operation  in  a  circuit  simu¬ 
lation  program  is  the  model  evaluation,  i.e.  the  compu¬ 
tation  of  the  coefficients  for  the  Jacobian  matrix.  Since 
these  coefficients  only  depend  on  the  node  voltages  and 
the  derivative  operator,  they  can  easily  be  computed 
in  parallel.  In  this  study  some  methods  for  computing 
the  Jacobian  matrix  are  discussed.  The  applicability  of 
the  methods  with  the  concurrent  waveform  relaxation 
method  is  also  discussed.  New  results  on  the  complex¬ 
ity  of  row- wise  and  column-wise  model  evaluation,  and 
related  stability  problems  are  presented.  Experimen¬ 
tal  results,  with  respect  to  efficiency  and  stability  of 
the  coefficient  evaluation  as  well  as  parallel  execution, 
are  given.  These  experiments  have  been  carried  out 
with  the  CONCISE  circuit  simulation  program  on  a  Sy- 
mult  s2010  and  on  a  Sequent  Symmetry. 


1  Introduction 

A  circuit  simulation  program  solves  a  system  of  nonlin¬ 
ear  ordinary  differential  equations  (ODEs).  Each  ODE 
is  derived  by  meams  of  nodal  analysis.  Thus,  the  sum 
of  the  device  currents  entering  and  leaving  each  circuit 
node  is  equated  to  zero.  Each  current  contribution  is 
computed  by  evaluating  device  model  equations.  By 
using  the  node  voltages  one  can  compute  current  con¬ 
tributions  from  resistors,  capacitors,  transistors  etc. 

The  system  of  nonlinear  ODEs  is  traditionally  solved  by 
means  of  a  difference  approximation  of  the  time  deriva¬ 
tive,  Newton- Raphson  iterations  for  linearization,  and 
LU-factorization  for  solving  the  matrix  equation  in  the 
innermost  loop.  This  scheme  is  often  referred  to  as  a 
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Figure  1;  Program  flow  for  transient  analysis. 
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Figure  2:  Program  flow  for  the  waveform  relaxation  method. 
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Figure  3;  Program  flow  for  the  hierarchical  waveform  relax¬ 
ation  method. 


direct  method  or  a  method  with  global  time-step  [6], 
see  figure  1. 

Concurrent  circuit  simulation  programs  often  use  an  it¬ 
erative  method  to  split  the  simulation  task  into  sub¬ 
tasks,  or  subsystems.  With  such  a  splitting,  all  sub¬ 
systems  become  decoupled  during  aui  iteration  and  can 
all  be  solved  in  parallel.  After  each  iteration,  results 
are  exchanged  between  subsystems.  The  iterations  con¬ 
tinue  until  consecutive  results  are  sufficiently  close,  see 
figure  2. 

Compared  to  the  direct  method,  the  iterative  method 
solves  the  system  many  times,  one  for  each  iteration. 
However,  with  the  iterative  method  the  matrix  inversion 
becomes  very  simple  and  some  CPU-time  is  gained  this 
way.  A  further  enhancement  of  the  iterative  method 
is  to  use  waveform  relaxations  rather  than  time-point 
rel2Lxations  [1].  With  the  former  method  all  iterations 
are  performed  on  the  functional  level,  that  is  each  ODE 
subsystem  is  solved  for  the  entire  simulation  interval, 
or  a  time  window,  during  an  iteration.  The  result  of 
such  an  iteration  is  a  set  of  node  voltage  waveforms. 
Within  each  subsystem  the  integration  method  can  op¬ 
timize  the  time  steps  for  its  set  of  ODEs  which  saves 
CPU-time.  Thus  the  waveform  relaxation  method  is  a 
multirate  integration  method. 

The  combination  of  multirate  integration  and  simple 
matrix  inversion  makes  the  iterative  method  compara¬ 
ble  in  performance  to  the  direct  method  [4].  In  addi¬ 
tion  it  is  fairly  straightforward  to  modify  the  waveform 
relaxation  method  to  run  on  a  multicomputer  [2].  How¬ 
ever,  special  attention  has  to  be  paid  to  the  convergence 


rate  of  the  waveform  iterations  and  the  size  of  the  sub¬ 
systems  as  slow  convergence  may  lead  to  excessively 
long  execution  times  and  large  subsystems  limits  the 
concurrency. 

Ideally  a  subsystem  should  correspond  to  one  circuit 
equation,  that  is  a  circuit  node.  In  this  case,  as  have 
been  demonstrated  with  the  circuit  simulation  program 
CONCISE  [2,  3],  the  available  speedup  is  a  fair  fraction 
of  the  number  of  circuit  nodes.  However,  if  some  circuit 
nodes  are  strongly  coupled,  the  corresponding  equations 
have  to  be  solved  together  with  a  direct  method  or  con¬ 
vergence  will  be  slow  [4].  When  some  smaller  subsys¬ 
tems  eue  gathered  into  one  larger  subsystem,  an  equa¬ 
tion  block,  such  a  subsystem  may  become  much  larger 
than  the  others.  This  will  result  in  load  imbalance  as 
only  one  computing  node  is  used  per  equation  block. 

It  is  not  always  possible  to  avoid  large  subsystems.  In 
order  to  prevent  a  large  subsystem  to  become  a  bottle¬ 
neck,  it  is  desirable  to  be  able  to  adlocate  several  com¬ 
puting  nodes  to  the  solution  of  such  a  large  subsystem, 
see  figure  3.  This  problem  is  similar  to  the  problem  of 
solving  the  entire  ODE  system  with  a  concurrent  direct 
method.  The  difference  is  that  the  largest  subsystem 
typically  is  much  smaller  than  the  entire  system  if  the 
system  itself  is  large.  As  have  been  pointed  out  ear¬ 
lier  [3],  there  is  much  less  concurrency  in  a  distributed 
direct  method  compared  to  the  waveform  relaxation 
method.  However,  for  solving  subtasks  a  distributed 
direct  method  may  be  attractive. 

When  augmenting  the  concurrent  waveform  relaxation 
method  to  use  two  levels  of  concurrency  -  parallel  evalu- 
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ation  of  subsystems  and  parallel  evaluation  within  sub¬ 
systems  -  one  must  address  the  partitioning  problem  as 
well  as  decide  how  much  of  the  direct  subsystem  solver 
should  be  distributed.  This  study  deals  with  the  lat¬ 
ter  problem,  the  first  problem  will  be  the  subject  of  a 
future  study.  Furthermore,  as  we  only  consider  subsys¬ 
tems  of  limited  size,  say  below  100  equations,  we  are 
primarily  interested  in  an  efficient  device  model  eval¬ 
uation  scheme  auid  not  a  distributed  LU-factorizatioi; 
algorithm. 

2  Computing  the  Jacobian 
Matrix 

When  a  subsystem,  or  block,  is  evaluated,  the  compu¬ 
tation  of  the  J  acobian  matrix  coefficients  dominates  the 
CPU  usage.  By  parallelizing  the  device  model  evalua¬ 
tion,  one  may  obtain  a  better  overall  speedup. 

The  coefficients  of  the  Jacobian  are  partial  derivatives 
of  node  currents  (function  values)  with  respect  to  node 
voltages.  These  derivatives  can  be  computed  by  means 
of  a  difference  approximation,  with  row-wise  or  column¬ 
wise  perturbations,  or  they  can  be  derived  analytically. 

When  the  function  values  (node  currents)  stem  from 
a  multivalued  function,  that  is,  when  all  function  val¬ 
ues  are  computed  in  a  single  routine,  the  column-wise 
perturbation  requires  fewer  function  calls  than  the  row¬ 
wise  method.  The  former  is  0{n)  and  the  later  is  O(n^). 
However,  in  a  circuit  simulation  program,  the  Jacobian 
matrix  is  derived  by  computing  device  transfer  admit¬ 
tances.  Thus,  the  coefficients  are  computed  by  summing 
contributions  from  devices.  If  a  single  coefficient  is  to 
be  computed,  only  the  devices  connected  to  the  circuit 
node  corresponding  to  the  coefficient  in  question  need 
to  be  evaluated.  Thus,  row-wise  perturbation  cannot 
be  discarded  without  further  analysis. 

Definitions  and  derivations 

Let  n  be  the  rank  of  the  Jacobian  matrix,  that  is  the 
number  of  circuit  nodes  in  a  particular  subsystem.  Also, 
let  k  be  the  number  of  devices  connected  to  circuit  nodes 
in  this  particular  subsystem.  Finally,  let  6  be  the  aver¬ 
age  number  of  terminals  each  device  has  connected  to 
other  circuit  nodes  inside  the  subsystem. 

From  these  three  basic  parameters  we  can  derive  two 
additional  interesting  characteristics.  First  we  define 
the  density  d,  that  is  the  average  number  of  devices  per 
circuit  node,  as 


Then  we  define  m  as  the  number  of  nonzero  entries 
in  the  Jacobian  matrix  per  row.  As  devices  may  be 
connected  in  parallel  it  is  clear  that  m  <d. 

For  electrical  circuits,  each  device  only  has  a  limited 
number  of  terminab.  Furthermore,  most  circuit  nodes 
are  only  connected  with  a  few  neighbor  nodes  -  local 
interaction.  Thus  the  Jacobian  is  typically  very  sparse, 
and  6  b  in  the  range  of  3  —  5  for  large  subsystems.  It 
abo  follows  that  k  b  proportional  to  n. 

Row-wise  calculation 

Let  Crow  be  the  number  of  device  model  evaluations 
for  the  row-wise  calculation  used  in  CONCISE.  Then 
we  need 

Crow  =  n(l-fTn)d  (2) 

model  evaluations  to  compute  the  Jacobian  matrix. 
First  n  ■  d  evaluations  to  calculate  the  nominal  values 
for  all  the  circuit  nodes.  Then,  to  get  all  the  partial 
derivatives  we  need  to  displace  each  of  the  other  circuit 
node  voltages  for  each  row  and  recalculate  them.  Each 
such  calculation  costs  d  and  there  are  m  of  them  per 
row  and  we  have  n  rows. 

Crow  =  +  »7»)d 

The  above  expression  can  be  simplified  by  inserting  the 
definition  of  d  and  using  the  worst-case  value  for  m 

Crow  <  ri(l  + —)— =  kb{\  + —)  (3) 

n  n  n 

Column-wise  calculation 

With  a  general  column- wise  calculation  method,  the 
number  of  device  evaluations  becomes 

Cool  =  k{l-\-n).  (4) 

First  k  device  evaluations  are  performed  to  get  the  nom¬ 
inal  values.  Then  we  need  one  function  evaluation  per 
circuit  node  (n)  to  get  the  displaced  values  for  the  par¬ 
tial  derivatives.  For  each  of  these  function  evaluations 
we  need  k  device  evaluations. 

However,  to  get  each  displaced  circuit  node  value  we 
only  need  to  evaluate  the  components  which  are  con¬ 
nected  directly  to  the  circuit  node  in  question  and  to 
its  neighbor  circuit  nodes* .  Thus,  firstly  we  get  n  times 

*The  neighbor  devices  contribute  to  the  tot&l  current  in  the 
present  node. 
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the  cost  for  each  circuit  node.  Secondly,  for  each  circuit 
node  we  take  all  its  neighbors,  m  —  1 ,  and  evaluate  all 
their  device  models,  which  are  d  each.  This  is  obviously 
the  worst  case  since  some  of  these  components  gener¬ 
ally  axe  the  same  and  these  components  only  need  to 
be  evaluated  once.  In  addition  to  these  components  we 
also  need  to  evaluate  the  components  which  only  con¬ 
tribute  to  the  diagonal  for  the  circuit  node  in  question. 
The  number  of  these  “diagonal-only”  elements  per  cir¬ 
cuit  node  is  called  j,  with  j  <  d  —  m.  Thus,  we  get  a 
worst-case  expression  for  the  number  of  circuit  evalua¬ 
tions  in  this  node  as 

Ceoi  =  k  +  n((m  -  l)d  -f-  j)  (5) 

After  insertion  of  the  worst-case  value  for  j  and  simpli¬ 
fication  we  get 

Ccoi  <  I:-t-nm(d-l)  (6) 
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As  previously  mentioned  an  electrical  device  only  has  a 
small  number  of  terminals.  Thus,  a  device  typically  has 

1- 3  ungrounded  terminals,  and  consequently  there  are 

2- 3  three  devices  per  circuit  node.  By  letting  the  num¬ 
ber  of  devices  depend  on  the  number  of  circuit  nodes, 
k  =  an,  we  can  compare  Crow  with  Ceoi-  Thus,  by 
using  the  worst-case  expressions  for  Crow  and  Ceoi  we 
get 

Ceoi 
Crow 


n((o6)^  -(-  a(l  —  6)) 


1  - 


n((a6)2  -f  a6) 
26-1 
6(1-1-  q6) 


(7) 


As  26  1  we  can  simplify  this  further  to 

Ceoi  .  2 

Crow  1  “1"  a6 


(8) 


The  row-wise  and  column-wise  perturbation  methods 
for  coefficient  evaluation  have  been  tested  in  CONCISE, 
where  the  comparison  was  done  in  a  single  computing 
node  On  a  variety  of  circuits,  both  methods  perform 
roughly  the  same,  see  table  1. 

Differences  between  the  methods  in  terms  of  Newton- 
Raphson  iterations  and  time-points  are  due  to  the 
fact  that  the  perturbation  is  computed  row-wise  and 
column-wise,  yielding  slightly  different  Jacobian  matri¬ 
ces. 

It  is  clear  from  the  experimental  data,  that  row-wise 
and  column-wise  coefficient  evaluation  perform  roughly 
the  same.  The  row-wise  evaluation  scheme  has  a  great 
advantage  in  that  the  data  structure  is  much  simpler. 
With  this  method  it  is  only  necessary  to  know  which 
devices  are  connected  to  each  node,  and  not  to  keep 
track  of  neighbor  nodes  as  in  the  column-wise  scheme. 


Both  coefficient  evaluation  schemes  are  now  0(n). 
However,  it  is  clear  that  the  column-wise  perturba¬ 
tion  scheme  still  requires  fewer  device  model  evalua¬ 
tions,  typically  70  —  85%  of  the  number  of  row-wise 
evaluations.  On  the  other  hand,  with  the  column-wise 
method,  each  device  model  evaluation  is  more  complex 
as  all  terminal  currents  must  be  computed.  With  the 
row-wise  scheme,  only  the  currents  for  the  terminal  con¬ 
nected  to  the  node  in  question  need  to  be  evaluated. 
Thus,  the  performance  of  both  methods  should  be  com¬ 
parable. 


3  Stability  problems 

Because  of  symmetry  in  some  device  models,  a  differ¬ 
ence  approximation  of  the  device  derivatives  sometimes 
yields  a  singular  device  Jacobian^.  This  is,  for  exam¬ 
ple,  true  for  the  MOS  transistor.  The  MOS  device  is 
symmetrical  with  respect  to  drain  and  source.  For  an 
n-channel  device  the  most  positive,  highest  voltage,  one 

t  Or  near  singular  because  of  the  limited  numerical  accuracy. 
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Table  3;  Rejected  time-steps  for  the  fifo2  circuit 

of  the  drain  and  the  source  terminals  becomes  the  “de 
facto”  drain  terminal.  Thus,  if  the  drain  to  source  volt¬ 
age  is  smaller  than  the  perturbation  in  the  difference 
approximation,  the  source  and  drain  of  the  device  may 
change  place  during  the  derivative  computation.  When 
the  terminals  are  changed,  the  same  equations  are  used 
for  both  the  source  and  the  drain,  yielding  a  singular 
device  Jacobian.  In  the  scalar  case,  when  the  rank  of 
the  Jacobian  is  1  and  only  one  device  terminal  is  eval¬ 
uated,  the  symmetry  of  the  device  causes  no  problem. 

This  stability  problem  has  been  experimentally  verified 
on  a  CMOS  circuit.  In  table  2  results  are  shown  from 
a  run  with  a  test  circuit  causing  sever  problems  for 
both  perturbation  methods.  The  number  of  Newton- 
Raphson  iterations  and  time-points  are  comparable,  but 
the  CPU-times  differ. 

In  table  2,  the  problem  only  shows  up  in  the  CPU-time 
column.  By  tabulating  the  rejected  number  of  Newton- 
Raphson  iterations  and  time-points,  table  3,  and  com¬ 
paring  these  with  the  number  of  accepted  iterations  and 
steps,  the  stability  problems  are  more  obvious. 

It  is  interesting  to  note  that,  in  spite  of  the  problems 
with  a  near  singular  Jacobian,  the  waveform  relaxation 
still  converged,  if  slowly. 

Analytical  derivatives 

Analyticad  derivatives  require  the  fewest  number  of  de¬ 
vice  model  evaluations,  but  make  device  equations  more 
complex  and  the  unpacking  of  data  more  difficult.  The 
number  of  parallelizable  operations  is  also  smaller  com¬ 
pared  to  the  perturbation  methods.  With  analytical 
derivatives,  a  device  is  evaluated  only  once  per  iter¬ 
ation.  With  row-wise  or  column-wise  computation,  a 


CPU  (s) 

Number  of  coeff.  eval. 

Circuit 

rank 

1 

2 

4 

8 

add 

106 

4500 

2380 

1300 

1140 

bufhot 

8 

80.0 

45.0 

35.7 

37.8 

dflip 

19 

280 

147 

83.1 

83.5 

jc_add 

105 

1540 

981 

982 

987 

jc.dflip 

19 

40.6 

28.3 

28.8 

29.8 

jc_ram 

122 

1460 

1150 

1160 

1160 

jc.two 

7 

33.5 

27.2 

28.2 

30.1 

Table  4:  Run  time  with  distributed  Jacobian  computation 

device  is  evaluated  once  for  each  entry  in  the  Jacobian, 
and  once  for  each  entry  in  the  right  hand  side  of  the 
equation  system.  Thus,  more,  and  smaller,  tasks  are 
available  for  parallel  evaluation  with  the  perturbation 
methods. 

In  the  above  tables  a  column  for  analytical  derivatives 
has  been  added.  Although  the  number  of  device  evalua¬ 
tions  is  smaller  than  for  the  other  methods,  the  required 
CPU-time  is  not  much  less.  This  is  due  to  the  fact  that 
one  device  evaluation  is  much  more  complex  than  with 
the  other  schemes.  In  fact,  for  the  MOS  device  the  code 
is  5-10  times  longer  with  analytical  derivatives.  Thus 
numerical  derivation  would  be  preferable  if  the  stability 
problems  mentioned  above  could  be  avoided. 

A  second  problem  with  an  analytical  derivative  compu¬ 
tation  is  the  fact  that  the  results  of  the  device  evalua¬ 
tions  have  to  be  communicated  in  messages  containing 
the  entire  device  Jacobian.  This  Jacobian  can  have  a 
rank  from  one  to  four  for  electrical  circuits,  and  much 
higher  for  chemical  “devices”  [5].  Since  the  packing  and 
unpacking  of  this  data  adds  to  the  sequential  fraction  [3] 
of  the  subsystem  evaluation,  this  scheme  appears  to  be 
less  attractive  than,  especially,  the  row-wise  perturba¬ 
tion  method!. 

4  Experimental  results  and 
further  work 

Row-wise  difference  approximation  of  the  subsystem 
Jacobian  matrix  has  been  implemented  in  CONCISE. 
This  version  does  not  include  the  entire  hierarchical 
waveform  relaxation  as  depicted  in  figure  3  as  we  can¬ 
not  partition  circuits  hierarchically  presently.  The  test 
program  is  however  ready  to  perform  a  hierarchical  sim¬ 
ulation  once  the  partitioning  part  is  ready. 

i  Given  that  the  stability  problenu  of  the  perturbation  method 
can  be  eliminated. 
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N  (number  of  nodes) 


Figure  4:  Inverted  speedup  with  distributed  Jacobian  com¬ 
putation 


CPU  (s) 

CPU  (s) 

Number  of  coeff.  eval. 

Circuit 

rank 

1 

2 

4 

8 

add 

106 

3680 

1940 

1090 

810 

bufhot 

8 

60.0 

38.6 

31.3 

32.2 

dflip 

19 

274 

157 

95.1 

71.2 

ram 

123 

2480 

1350 

854 

667 

two 

7 

24.3 

15.3 

12.0 

1 

jcjadd 

105 

963 

579 

406 

333 

jc.dflip 

19 

25.9 

16.6 

13.9 

14.2 

jcjam 

122 

860 

539 

434 

440 

jc_two 

7 

19.7 

14.5 

14.4 

1 

'  No  result  as  rank  <  number  of  nodes 


Table  5:  Run  time  with  distributed  integration  algorithm 
and  Jacobian  computation 


N  (number  of  nodes) 


Figure  5:  Inverted  speedup  with  distributed  integration  al¬ 
gorithm  and  Jacobian  computation 


Results  of  runs  on  a  Symult  s2010  are  shown  in  table  4. 
The  table  shows  the  CPU-time  for  various  test  circuits 
when  1,2,4,  and,  8  computing  nodes  have  been  used  for 
evaluating  the  Jacobian  matrix. 

The  data  in  table  4  and  figure  4  shows  that  a  speedup 
of  four  can  be  achieved.  This  result  is  with  accurate  de¬ 
vice  models  (complex  MOS  model  and  diffusion  diodes), 
with  very  simple  models  (simple  MOS  model  and  lin¬ 
ear  capacitors,  circuits  with  jc_  prefix)  the  speedup  is 
limited  to  less  than  a  factor  of  two.  This  confirms  pre¬ 
dictions  in,  e.g.  [3]. 

The  speedup  does  not  depend  strongly  on  the  rank  of 
the  Jacobian,  but  more  on  what  device  models  are  be¬ 
ing  used.  Thus,  the  speedup  is  limited  by  coefficient 
unpacking,  matrix  inversion,  and  the  numerical  inte¬ 
gration  in  this  case. 

To  verify  this,  the  node  voltage  prediction,  companion 
source  computation,  and  integration  error  estimation 
was  also  distributed.  These  computations  were  spread 
out  evenly  over  aJl  the  evaluation  nodes.  Thus  a  node 
program  is  responsible  for  solving  the  matrix  equation, 
sending  tasks  to  the  device  model  evaluation  nodes,  and 
for  controlling  the  integration  algorithm. 

With  this  approach  data  distribution  is  rather  simple, 
but  all  nodes  need  to  exchange  node  voltages  and  com¬ 
panion  source  values.  The  local  truncation  error  es¬ 
timations  are  sent  to  the  computing  node  responsible 
for  the  integration  error  control  and  step  length  com¬ 
putation.  A  more  efficient  Jacobian  matrix  unpacking 
scheme,  using  precomputed  pointers  rather  than  search 
methods  v/as  also  employed. 

Somewhat  better  speedups  are  achieved  by  these  mod¬ 
ifications  as  can  be  seen  in  table  5  and  figure  5.  With 
the  dflip  circuit,  some  15%  of  the  total  CPU-time  of 
the  node  program  was  spent  doing  matrix  inversions 
and  another  15%  to  compute  the  Newton-Raphson  iter¬ 
ation  errors.  Approximately  30%  of  the  CPU-time  was 
spent  on  packing  and  unpacking  data.  At  this  stage  it 
seems  to  be  more  important  to  minimize  the  amount  of 
data  that  has  to  be  sent  around  than  to  distribute  the 
LU-factorization. 

Our  results  show  that  it  is  possible  to  combine  a  dis¬ 
tributed  direct  ODE-solver  and  the  waveform  relaxation 
method.  The  speedup  from  the  subsystem  solver  is  not 
as  good  ^ts  for  the  waveform  relaxation  part.  Thus, 
although  being  useful,  the  direct  method  is  final  solu¬ 
tion  to  the  waveform  relaxation  load  imbalance  prob¬ 
lem.  Dynamic  partitioning  is  probably  needed  to  fur¬ 
ther  enhance  performance. 
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